# Methods and AlgorithmsDescription of analysis methods and explanation of the algorithms involved

#### View by tag

This document explains the concepts involved and how they are applied within the GATK (and Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.

### 1. Introducing the concept of parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).

Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.

This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.

The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.

Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.

Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.

#### Parallel computing in practice (sort of)

OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?

Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.

Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:

• open the file, count the number of lines in the file, tell us the number, close the file

Note that tell us the number can mean writing it to the console, or storing it somewhere for use later on.

Now let's say we want to know the number of words on each line. The set of instructions would be:

• open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number

And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.

So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:

• open the file, index the lines

• read the first line, count the number of words, tell us the number

• read the second line, count the number of words, tell us the number
• read the third line, count the number of words, tell us the number
• [repeat for all lines]

• collect final results and close the file

Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.

You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.

Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.

### 2. Parallelizing the GATK

There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.

#### A quick word about levels of computing

By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster.

• Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.

• Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.

• Cluster: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.

Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.

In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.

If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.

Hey, if you have a better example, let us know in the forum and we'll use that instead.

Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:

• -nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)

• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).

Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.

In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.

#### Scatter-gather

If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.

So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?

As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves a separate program, called Queue, which generates separate GATK jobs (each with its own command-line) to achieve the instructions given in a so-called Qscript (i.e. a script written for Queue in a programming language called Scala).

At the simplest level, the Qscript can involve a single GATK tool*. In that case Queue will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, Queue will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).

Note that Queue has additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation.

#### Compare and combine

So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But Queue can dispatch scattered GATK jobs to different machines in a computing cluster by interfacing with your cluster's job management software.

That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.

The good news is that you can combine scatter-gather and multithreading: use Queue to scatter GATK jobs to different nodes on your cluster, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.

Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.

### IMPORTANT NOTE: This document is out of date and will be replaced soon. In the meantime, you can find accurate information on how to run SnpEff in a compatible way with GATK in the SnpEff documentation, and instructions on what steps are necessary in the presentation on Functional Annotation linked in the comments below.

Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.

### Introduction

Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.

### SnpEff Setup and Usage

Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.

After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:

database_repository = http://sourceforge.net/projects/snpeff/files/databases/


java -jar snpEff.jar download GRCh37.64


You can find a list of available databases here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.

The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the data_dir entry in the file snpEff.config to point to the correct directory.

Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.

### Supported SnpEff Versions

If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version **2.0.5*. Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.

### Current Recommended Best Practices When Running SnpEff

These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.

• We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.

• We recommend always running with -onlyCoding true with human databases (eg., the GRCh37.* databases). Setting -onlyCoding false causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The -onlyCoding false option should only be used with databases that lack protein coding information.

• Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.

### Analyses of SnpEff Annotations Across Versions

See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions here.

• Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.

• snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.

See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 here

• The GRCh37.64 database gives good results on the condition that you run snpEff with the -onlyCoding true option. The -onlyCoding false option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.

• The GRCh37.65 database gives results that are as poor as those you get with the -onlyCoding false option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.

See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions here

• snpEff's indel annotations are highly concordant with those of a high-quality set of genomic annotations from the 1000 Genomes project. This is true across all snpEff/database versions tested.

### Example SnpEff Usage with a VCF Input File

Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the data_dir as defined in snpEff.config.

java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf


In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:

EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.


And here is the precise layout with all the subfields:

EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...


It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.

### Adding SnpEff Annotations using VariantAnnotator

Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.

There are two different options for doing this:

#### Option 1: Annotate with only the highest-impact effect for each variant

NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.

The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.

When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:

• Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.

• Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.

• Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.

Example Usage:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-A SnpEff \
--variant 1000G.exomes.vcf \        (file to annotate)
--snpEffFile snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


VariantAnnotator adds some or all of the following INFO field annotations to each variant record:

• SNPEFF_EFFECT - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)
• SNPEFF_IMPACT - Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)
• SNPEFF_FUNCTIONAL_CLASS - Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)
• SNPEFF_CODON_CHANGE - Old/New codon for the highest-impact effect resulting from the current variant
• SNPEFF_AMINO_ACID_CHANGE - Old/New amino acid for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_NAME - Gene name for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_BIOTYPE - Gene biotype for the highest-impact effect resulting from the current variant
• SNPEFF_TRANSCRIPT_ID - Transcript ID for the highest-impact effect resulting from the current variant
• SNPEFF_EXON_ID - Exon ID for the highest-impact effect resulting from the current variant

Example VCF records annotated using SnpEff and VariantAnnotator:

1   874779  .   C   T   279.94  . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066

1   874816  .   C   CT  2527.52 .   AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066


#### Option 2: Annotate with all effects for each variant

VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.

Here's an example of how to do this:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-E resource.EFF \
--variant 1000G.exomes.vcf \      (file to annotate)
--resource snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.

### List of Genomic Effects

Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.

#### High-Impact Effects

• SPLICE_SITE_ACCEPTOR
• SPLICE_SITE_DONOR
• START_LOST
• EXON_DELETED
• FRAME_SHIFT
• STOP_GAINED
• STOP_LOST

#### Moderate-Impact Effects

• NON_SYNONYMOUS_CODING
• CODON_CHANGE (note: this effect is used by SnpEff only for MNPs, not SNPs)
• CODON_INSERTION
• CODON_CHANGE_PLUS_CODON_INSERTION
• CODON_DELETION
• CODON_CHANGE_PLUS_CODON_DELETION
• UTR_5_DELETED
• UTR_3_DELETED

#### Low-Impact Effects

• SYNONYMOUS_START
• NON_SYNONYMOUS_START
• START_GAINED
• SYNONYMOUS_CODING
• SYNONYMOUS_STOP
• NON_SYNONYMOUS_STOP

#### Modifiers

• NONE
• CHROMOSOME
• CUSTOM
• CDS
• GENE
• TRANSCRIPT
• EXON
• INTRON_CONSERVED
• UTR_5_PRIME
• UTR_3_PRIME
• DOWNSTREAM
• INTRAGENIC
• INTERGENIC
• INTERGENIC_CONSERVED
• UPSTREAM
• REGULATION
• INTRON

### Functional Classes

SnpEff assigns a functional class to certain effects, in addition to an impact:

• NONSENSE: assigned to point mutations that result in the creation of a new stop codon
• MISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codon
• SILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon
• NONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)

The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.

# Sting BWA/C Bindings

WARNING: This tool is experimental and unsupported and just starting to be developed and used and should be considered a beta version. Feel free to report bugs but we are not supporting the tool

The GSA group has made bindings available for Heng Li's Burrows-Wheeler Aligner (BWA). Our aligner bindings present additional functionality to the user not traditionally available with BWA. BWA standalone is optimized to do fast, low-memory alignments from Fastq to BAM. While our bindings aim to provide support for reasonably fast, reasonably low memory alignment, we add the capacity to do exploratory data analyses. The bindings can provide all alignments for a given read, allowing a user to walk over the alignments and see information not typically provided in the BAM format. Users of the bindings can 'go deep', selectively relaxing alignment parameters one read at a time, looking for the best alignments at a site.

The BWA/C bindings should be thought of as alpha release quality. However, we aim to be particularly responsive to issues in the bindings as they arise. Because of the bindings' alpha state, some functionality is limited; see the Limitations section below for more details on what features are currently supported.

## A note about using the bindings

Whenever native code is called from Java, the user must assist Java in finding the proper shared library. Java looks for shared libraries in two places, on the system-wide library search path and through Java properties invoked on the command line. To add libbwa.so to the global library search path, add the following to your .my.bashrc, .my.cshrc, or other startup file:

export LD_LIBRARY_PATH=/humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH  ##### csh setenv LD_LIBRARY_PATH /humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH


To specify the location of libbwa.so directly on the command-line, use the java.library.path system property as follows:

java -Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
-jar dist/GenomeAnalysisTK.jar \
-T AlignmentValidation \
-I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta


## Preparing to use the aligner

We provide internally accessible versions of both the BWA shared library and precomputed BWA indices for two commonly used human references at the Broad (Homo_sapiens_assembly18.fasta and human_b36_both.fasta). These files live in the following directory:

/humgen/gsa-scr1/GATK_Data/bwa/stable


### Outside of the Broad Institute

Two steps are required in preparing to use the aligner: building the shared library and using BWA/C to generate an index of the reference sequence.

The Java bindings to the aligner are available through the Sting repository. A precompiled version of the bindings are available for Linux; these bindings are available in c/bwa/libbwa.so.1. To build the aligner from source:

• Fetch the latest svn of BWA from SourceForge. Configure and build BWA.
sh autogen.sh
./configure
make

• Customize the variables at the top one of the build scripts (c/bwa/build_linux.sh,c/bwa/build_mac.sh) based on your environment. Run the build script.

To build a reference sequence, use the BWA C executable directly:

bwa index -a bwtsw <your reference sequence>.fasta


## Using the existing GATK alignment walkers

Two walkers are provided for end users of the GATK. The first of the stock walkers is Align, which can align an unmapped BAM file or realign a mapped BAM file.

java \
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
-jar dist/GenomeAnalysisTK.jar \
-T Align \
-I NA12878_Pilot1_20.unmapped.bam \
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta \
-U \
-ob human.unsorted.bam


Most of the available parameters here are standard GATK. -T specifies that the alignment analysis should be used; -I specifies the unmapped BAM file to align, and the -R specifies the reference to which to align. By default, this walker assumes that the bwa index support files will live alongside the reference. If these files are stored elsewhere, the optional -BWT argument can be used to specify their location. By defaults, alignments will be emitted to the console in SAM format. Alignments can be spooled to disk in SAM format using the -o option or spooled to disk in BAM format using the -ob option.

The other stock walker is AlignmentValidation, which computes all possible alignments based on the BWA default configuration settings and makes sure at least one of the top alignments matches the alignment stored in the read.

java \
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
-jar dist/GenomeAnalysisTK.jar \
-T AlignmentValidation \
-I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta


Options for the AlignmentValidation walker are identical to the Alignment walker, except the AlignmentValidation walker's only output is a exception if validation fails.

Another sample walker of limited scope, CountBestAlignmentsWalker, is available for review; it is discussed in the example section below.

## Writing new GATK walkers utilizing alignment bindings

BWA/C can be created on-the-fly using the org.broadinstitute.sting.alignment.bwa.c.BWACAligner constructor. The bindings have two sets of interfaces: an interface which returns all possible alignments and an interface which randomly selects an alignment from a list of the top scoring alignments as selected by BWA.

To iterate through all functions, use the following method:

    /**
* Get a iterator of alignments, batched by mapping quality.
* @param bases List of bases.
* @return Iterator to alignments.
*/
public Iterable<Alignment[]> getAllAlignments(final byte[] bases);


The call will return an Iterable which batches alignments by score. Each call to next() on the provided iterator will return all Alignments of a given score, ordered in best to worst. For example, given a read sequence with at least one match on the genome, the first call to next() will supply all exact matches, and subsequent calls to next() will give alignments judged to be inferior by BWA (alignments containing mismatches, gap opens, or gap extensions).

Alignments can be transformed to reads using the following static method in org.broadinstitute.sting.alignment.Alignment:

    /**
* Creates a read directly from an alignment.
* @param alignment The alignment to convert to a read.
* @param unmappedRead Source of the unmapped read.  Should have bases, quality scores, and flags.
* @param newSAMHeader The new SAM header to use in creating this read.  Can be null, but if so, the sequence
*                     dictionary in the
* @return A mapped alignment.
*/


A convenience method is available which allows the user to get SAMRecords directly from the aligner.

    /**
* Get a iterator of aligned reads, batched by mapping quality.
* @param newHeader Optional new header to use when aligning the read.  If present, it must be null.
* @return Iterator to alignments.
*/


To return a single read randomly selected by the bindings, use one of the following methods:

    /**
* Allow the aligner to choose one alignment randomly from the pile of best alignments.
* @param bases Bases to align.
* @return An align
*/
public Alignment getBestAlignment(final byte[] bases);

/**
* Align the read to the reference.
* @return A list of the alignments.
*/


The org.broadinstitute.sting.alignment.bwa.BWAConfiguration argument allows the user to specify parameters normally specified to 'bwt aln'. Available parameters are:

• Maximum edit distance (-n)
• Maximum gap opens (-o)
• Maximum gap extensions (-e)
• Disallow an indel within INT bp towards the ends (-i)
• Mismatch penalty (-M)
• Gap open penalty (-O)
• Gap extension penalty (-E)

Settings must be supplied to the constructor; leaving any BWAConfiguration field unset means that BWA should use its default value for that argument. Configuration settings can be updated at any time using the BWACAligner updateConfiguration method.

    public void updateConfiguration(BWAConfiguration configuration);


## Running the aligner outside of the GATK

The BWA/C bindings were written with running outside of the GATK in mind, but this workflow has never been tested. If you would like to run the bindings outside of the GATK, you will need:

• The BWA shared object, libbwa.so.1
• The packaged version of Aligner.jar

To build the packaged version of the aligner, run the following command

cp STING_HOME/lib/bcel-*.jar ~/.ant/lib ant package -Dexecutable=Aligner  This command will extract all classes required to run the aligner and place them inSTING_HOME/dist/packages/Aligner/Aligner.jar. You can then specify this one jar in your project's dependencies.

## Limitations

The BWA/C bindings are currently in an alpha state, but they are extensively supported. Because of the bindings' alpha state, some functionality is limited. The limitations of these bindings include:

• Only single-end alignment is supported. However, a paired end module could be implemented as a simple extension that finds the jointly optimal placement of both singly-aligned ends.
• Color space alignments are not currently supported.
• Only a limited number of parameters BWA's extensive parameter list are supported. The current list of supported parameters is specified in the 'Writing new GATK walkers utilizing alignment bindings' section below.
• The system is not as heavily memory-optimized as the BWA/C implementation standalone. The JVM, by default, uses slightly over 4G of resident memory when running BWA on human. We have not done extensive testing on the behavior of the BWA/C bindings under memory pressure.
• There is a slight negative impact on performance when using the BWA/C bindings. BWA/C standalone on 6.9M reads of human data takes roughly 45min to run 'bwa aln', 5min to run 'bwa samse', and another 1.5min to convert the resulting SAM file to a BAM. Aligning the same dataset using the Java bindings takes approximately 55 minutes.
• The GATK requires that its input BAMs be sorted and indexed. Before using the Align or AlignmentValidation walker, you must sort and index your unmapped BAM file. Note that this is a limitation of the GATK, not the aligner itself. Using the alignment support files outside of the GATK will eliminate this requirement.

## Example: analysis of alignments with the BWA bindings

In order to validate that the Java bindings were computing the same number of reads as BWA/C standalone, we modified the BWA source to gather the number of equally scoring alignments and the frequency of the number of equally scoring alignments. We then implemented the same using a walker written in the GATK. We computed this distribution over a set of 36bp human reads and found the distributions to be identical.

The relevant parts of the walker follow.

public class CountBestAlignmentsWalker extends ReadWalker<Integer,Integer> {
/**
* The supporting BWT index generated using BWT.
*/
@Argument(fullName="BWTPrefix",shortName="BWT",doc="Index files generated by bwa index -d bwtsw",required=false)
String prefix = null;

/**
* The actual aligner.
*/
private Aligner aligner = null;

private SortedMap<Integer,Integer> alignmentFrequencies = new TreeMap<Integer,Integer>();

/**
* Create an aligner object.  The aligner object will load and hold the BWT until close() is called.
*/
@Override
public void initialize() {
BWTFiles bwtFiles = new BWTFiles(prefix);
BWAConfiguration configuration = new BWAConfiguration();
aligner = new BWACAligner(bwtFiles,configuration);
}

/**
* Aligns a read to the given reference.
* @param ref Reference over the read.  Read will most likely be unmapped, so ref will be null.
* @return Number of alignments found for this read.
*/
@Override
public Integer map(char[] ref, SAMRecord read) {
if(alignmentIterator.hasNext()) {
int numAlignments = alignmentIterator.next().length;
if(alignmentFrequencies.containsKey(numAlignments))
alignmentFrequencies.put(numAlignments,alignmentFrequencies.get(numAlignments)+1);
else
alignmentFrequencies.put(numAlignments,1);
}
return 1;
}

/**
* Initial value for reduce.  In this case, validated reads will be counted.
* @return 0, indicating no reads yet validated.
*/
@Override
public Integer reduceInit() { return 0; }

/**
* Calculates the number of reads processed.
* @param value Number of reads processed by this map.
* @param sum Number of reads processed before this map.
* @return Number of reads processed up to and including this map.
*/
@Override
public Integer reduce(Integer value, Integer sum) {
return value + sum;
}

/**
* Cleanup.
* @param result Number of reads processed.
*/
@Override
public void onTraversalDone(Integer result) {
aligner.close();
for(Map.Entry<Integer,Integer> alignmentFrequency: alignmentFrequencies.entrySet())
out.printf("%d\t%d%n", alignmentFrequency.getKey(), alignmentFrequency.getValue());
super.onTraversalDone(result);
}
}


This walker can be run within the svn version of the GATK using -T CountBestAlignments.

The resulting placement count frequency is shown in the graph below. The number of placements clearly follows an exponential.

## Validation methods

Two major techniques were used to validate the Java bindings against the current BWA implementation.

• Fastq files from E coli and from NA12878 chr20 were aligned using BWA standalone with BWA's default settings. The aligned SAM files were sorted, indexed, and fed into the alignment validation walker. The alignment validation walker verified that one of the top scoring matches from the BWA bindings matched the alignment produced by BWA standalone.
• Fastq files from E coli and from NA12878 chr20 were aligned using the GATK Align walker, then fed back into the GATK's alignment validation walker.
• The distribution of the alignment frequency was compared between BWA standalone and the Java bindings and was found to be identical.

As an ongoing validation strategy, we will use the GATK integration test suite to align a small unmapped BAM file with human data. The contents of the unmapped BAM file will be aligned and written to disk. The md5 of the resulting file will be calculated and compared to a known good md5.

## Unsupported: using the BWA/C bindings from within Matlab

Some users are attempting to use the BWA/C bindings from within Matlab. To run the GATK within Matlab, you'll need to add libbwa.so to your library path through the librarypath.txt file. The librarypath.txt file normally lives in $matlabroot/toolbox/local. Within the Broad Institute, the$matlabroot/toolbox/local/librarypath.txt file is shared; therefore, you'll have to create a librarypath.txt file in your working directory from which you execute matlab.

##
## FILE: librarypath.txt
##
## Entries:
##    o path_to_jnifile
##    o [alpha,glnx86,sol2,unix,win32,mac]=path_to_jnifile
##    o $matlabroot/path_to_jnifile ## o$jre_home/path_to_jnifile
##
$matlabroot/bin/$arch
/humgen/gsa-scr1/GATK_Data/bwa/stable


Once you've edited the library path, you can verify that Matlab has picked up your modified file by running the following command:

>> java.lang.System.getProperty('java.library.path')

ans =


Once the location of libbwa.so has been added to the library path, you can use the BWACAligner just as you would any other Java class in Matlab:

>> javaclasspath({'/humgen/gsa-scr1/hanna/src/Sting/dist/packages/Aligner/Aligner.jar'})
>> x = BWACAligner(BWTFiles('/humgen/gsa-scr1/GATK_Data/bwa/Homo_sapiens_assembly18.fasta'),BWAConfiguration())
>> y=x.getAllAlignments(uint8('CCAATAACCAAGGCTGTTAGGTATTTTATCAGCAATGTGGGATAAGCAC'));


We don't have the resources to directly support using the BWA/C bindings from within Matlab, but if you report problems to us, we will try to address them.

Detailed information about command line options for BaseRecalibrator can be found here.

## Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

• Reported quality score
• The position within the read
• The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

## Running the tools

### BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

• assigned quality score
• machine cycle producing this base
• current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

### Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:


java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I input.bam \
-BQSR recalibration_report.grp \
-o output.bam


After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

• the sum of the global difference between reported quality scores and the empirical quality
• plus the quality bin specific shift
• plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

### Miscellaneous information

• The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
• A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
• Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
• The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

## Example pre and post recalibration results

• Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
• There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

## The output of the BaseRecalibrator

• A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

### The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

• Arguments Table -- a table with all the arguments and its values
• Quantization Table
• Quality Score Table
• Covariates Table

#### Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:


#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...


#### Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:


#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...


This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.


#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762


#### Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.


#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...


#### Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.


#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...


## Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

• First do an initial round of SNP calling on your original, unrecalibrated data.
• Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
• Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

### Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

### Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

### Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

### The workflow

#### 1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir STAR --runMode genomeGenerate --genomeDir$genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>


2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir cd$runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>  3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass: genomeDir=/path/to/hg19_2pass mkdir$genomeDir
STAR --runMode genomeGenerate --genomeDir genomeDir --genomeFastaFiles hg19.fa \ --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>  4) The resulting index is then used to produce the final alignments as follows: runDir=/path/to/2pass mkdirrunDir
cd $runDir STAR --genomeDir$genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>


The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

java -jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics


#### 3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS


#### 4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

#### 5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

#### 6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf


#### 7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf


Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

### Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above.

### Overview and motivations

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.

This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

### Workflow details

#### 1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the following options:

--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

#### 2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

#### 3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

#### 4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

### ReorderSam

The GATK can be particular about the ordering of a BAM file. If you find yourself in the not uncommon situation of having created or received BAM files sorted in a bad order, you can use the tool ReorderSam to generate a new BAM file where the reads have been reordered to match a well-ordered reference file.

java -jar picard/ReorderSam.jar I= lexicographc.bam O= kayrotypic.bam REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta


This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. This tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a lift over tool!

The tool, though once in the GATK, is now part of the Picard package.

### This utility replaces read groups in a BAM file

It is useful for fixing problems such as not having read groups in a bam file.

java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo


Note that this tool is now part of the Picard package: http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups

This tool can fix BAM files without read group information:

# throws an error
java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I testdata/exampleNORG.bam -T UnifiedGenotyper

java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo CREATE_INDEX=True

# runs without error
java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I exampleNewRG.bam -T UnifiedGenotyper


### Creating Amplicon Sequences

Note that earlier versions of the GATK used a different tool.

For a complete, detailed argument reference, refer to the GATK document page here

## Introduction

This tool generates amplicon sequences for use with the Sequenom primer design tool. The output of this tool is fasta-formatted, where the characters [A/B] specify the allele to be probed (see Validation Amplicons Output further below). It can mask nearby variation (either by 'N' or by lower-casing characters), and can try to restrict sequenom design to regions of the amplicon likely to generate a highly specific primer. This tool will also flag sites with properties that could shift the mass-spec peak from its expected value, such as indels in the amplicon sequence, SNPs within 4 bases of the variant attempting to be probed, or multiple variants selected for validation falling into the same amplicon.

#### Lowercase and Ns

Ns in the amplicon sequence instructs primer design software (such as Sequenom) not to use that base in the primer: any primer will fall entirely before, or entirely after, that base. Lower-case letters instruct the design software to try to avoid using the base (presumably by applying a penalty for doing so), but will not prevent it from doing so if a good primer (i.e. a primer with suitable melting temperature and low probability of hairpin formation) is found.

#### BWA Bindings

ValidationAmplicons relies on the GATK Sting BWA/C bindings to assess the specificity of potential primers. The wiki page for Sting BWA/C bindings contains required information about how to download the appropriate version of BWA, how to create a BWT reference, and how to set your classpath appropriately to run this tool. If you have not followed the directions to set up the BWA/C bindings, you will not be able to create validation amplicon sequences using the GATK. There is an argument (see below) to disable the use of BWA, and lower repeats within the amplicon only. Use of this argument is not recommended.

## Running Validation Amplicons

Validation Amplicons requires three input files: a VCF of alleles you want to validate, a VCF of variants you want to mask, and a Table of intervals around the variants describing the size of the amplicons. For instance:

Alleles to Validate

##fileformat=VCFv4.0
#CHROM      POS     ID      REF     ALT     QUAL    FILTER  INFO
20          207414  .       G    A       85.09   PASS    .   // SNP to validate
20          792122  .    TCCC    T       22.24   PASS    .   // DEL to validate
20          994145  .       G    GAAG    48.21   PASS    .   // INS to validate
20          1074230 .       C    T       2.29     QD     .   // SNP to validate (but filtered)
20          1084330 .      AC   GT       42.21   PASS    .   // MNP to validate


Interval Table

HEADERpos               name
20:207334-207494        20_207414
20:792042-792202        20_792122
20:994065-994225        20_994145
20:1074150-1074310      20_1074230
20:1084250-1084410      20_1084330


##fileformat=VCFv4.1
#CHROM  POS      ID        REF         ALT     QUAL       FILTER         INFO
20      207414   .         G           A       77.12      PASS            .
20      207416   .         A           AGGC    49422.34   PASS            .
20      792076   .         A           G       2637.15    HaplotypeScore  .
20      792080   .         T           G       161.83     PASS            .
20      792087   .         CGGT        C       179.84     ReadPosRankSum  .
20      792106   .         C           G       32.59      PASS            .
20      792140   .         C           G       409.75     PASS            .
20      1084319  .         T          A,C      22.24      PASS            .
20      1084348  .  TACCACCCCACACA     T      482.84      PASS            .


## Validation Amplicons Output

The output from Validation Amplicons is a fasta-formatted file, with a small adaptation to represent the site being probed. Using the test files above, the output of the command

java -jar GATK/dist/GenomeAnalysisTK.jar \ -T ValidationAmplicons \ -R /humgen/1kg/reference/human_g1k_v37.fasta \ -BTI ProbeIntervals \ --ProbeIntervals:table interval_table.table \ --ValidateAlleles:vcf sites_to_validate.vcf \ --MaskAlleles:vcf mask_sites.vcf \ --virtualPrimerSize 30 \ -o probes.fasta \ -l WARN  is >20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414 CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC >20:792122 Valid 20_792122 TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT >20:994145 Valid 20_994145 TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC >20:1074230 SITE_IS_FILTERED=1, 20_1074230 ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC >20:1084330 DELETION=1, 20_1084330 CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCC  Note that SNPs have been masked with 'N's, filtered 'mask' variants do not appear, the insertion has been flanked by Ns, the unfiltered deletion has been replaced by Ns, and the filtered site in the validation VCF is not marked as valid. In addition, bases that fall inside at least one non-unique 30-mer (meaning no multiple MQ0 alignments using BWA) are lower-cased. The identifier for each sequence is the position of the allele to be probed, a 'validation status' (defined below), and a string representing the amplicon. Validation status values are: Valid // amplicon is valid SITE_IS_FILTERED=1 // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant") VARIANT_TOO_NEAR_PROBE=1 // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak MULTIPLE_PROBES=1, // multiple variants to be validated found inside the same amplicon DELETION=6,INSERTION=5, // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate DELETION=1, // deletion found inside the amplicon region, could shift mass-spec peak START_TOO_CLOSE, // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer END_TOO_CLOSE, // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer NO_VARIANTS_FOUND, // no variants found within the amplicon region INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself)  ## Warnings During Traversal The files provided to Validation Amplicons should be such that all generated amplicons are valid. That means: There are no variants within 4bp of the site to be validated There are no indels in the amplicon region Amplicon windows do not include other sites to be probed Amplicon windows are not too short, and the variant therein is not within 50bp of either edge All amplicon windows contain a variant to be validated Variants to be validated are unfiltered or pass filters  The tool will warn you each time any of these conditions are not met. ## Contents ## Introduction ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF. ## GATK Documentation For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector. ## Sample and Frequency Restrictions ### -sampleMode The -sampleMode argument controls the mode of sample-based site consideration. The options are: • None: All sites are included for consideration, including reference sites • Poly_based_on_gt: Site is included if it has a variant genotype in at least one of the selected samples • Poly_based_on_gl: Site is included if it is likely to be variant based on the genotype likelihoods of the selected samples ### -samplePNonref Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95 ### -frequencySelectionMode The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are: • Uniform: Choose variants uniformly, without regard to their allele frequency. • Keep AF Spectrum: Choose variants so that the resulting allele frequency matches as closely as possible to that of the input VCF. ### Please note that the DataProcessingPipeline qscript is no longer available. The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust. ## Data Processing Pipeline The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK. ### Introduction Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced). ### Requirements This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package. Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs. ### Command-line arguments #### Required Parameters Argument (short-name) Argument (long-name) Description -i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files. -R <fasta> --reference <fasta> Reference fasta file. -D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format). #### Optional Parameters Argument (short-name) Argument (long-name) Description -indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment. -bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option) -outputDir <path> --output_directory <path> Output path for the processed BAM files. -L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only -intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals #### Modes of Operation (also optional parameters) Argument (short-name) Argument (long-name) Description -p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam -knowns --knowns_only Perform cleaning on knowns only. -sw --use_smith_waterman Perform cleaning using Smith Waterman -bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads -bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads ## The Pipeline Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file): Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline: ### BWA alignment This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline. ### Sample Level Processing This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration. #### Indel Realignment This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned. #### Base Quality Score Recalibration This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate ### The Outputs The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output. #### Processed Bam File The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline. #### Validation Files We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation. Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format. #### Base Quality Score Recalibration Analysis PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post. ### Examples 1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa) java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run 2. Performing realignment and the full data processing pipeline in one pair-ended bam file java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run ### Overview This document describes the purpose and general principles of the Genotype Refinement workflow. For the mathematical details of the methods involved, please see the Genotype Refinement math documentation. For step-by-step instructions on how to apply this workflow to your data, please see the Genotype Refinement tutorial. ## 1. Introduction The core GATK Best Practices workflow has historically focused on variant discovery --that is, the existence of genomic variants in one or more samples in a cohorts-- and consistently delivers high quality results when applied appropriately. However, we know that the quality of the individual genotype calls coming out of the variant callers can vary widely based on the quality of the BAM data for each sample. The goal of the Genotype Refinement workflow is to use additional data to improve the accuracy of genotype calls and to filter genotype calls that are not reliable enough for downstream analysis. In this sense it serves as an optional extension of the variant calling workflow, intended for researchers whose work requires high-quality identification of individual genotypes. A few commonly asked questions are: ### What studies can benefit from the Genotype Refinement workflow? While every study can benefit from increased data accuracy, this workflow is especially useful for analyses that are concerned with how many copies of each variant an individual has (e.g. in the case of loss of function) or with the transmission (or de novo origin) of a variant in a family. ### What additional data do I need to run the Genotype Refinement workflow? If a “gold standard” dataset for SNPs is available, that can be used as a very powerful set of priors on the genotype likelihoods in your data. For analyses involving families, a pedigree file describing the relatedness of the trios in your study will provide another source of supplemental information. If neither of these applies to your data, the samples in the dataset itself can provide some degree of genotype refinement (see section 5 below for details). ### Is the Genotype Refinement workflow going to change my data? Can I still use my old analysis pipeline? After running the Genotype Refinement workflow, several new annotations will be added to the INFO and FORMAT fields of your variants (see below), GQ fields will be updated, and genotype calls may be modified. However, the Phred-scaled genotype likelihoods (PLs) which indicate the original genotype call (the genotype candidate with PL=0) will remain untouched. Any analysis that made use of the PLs will produce the same results as before. ## 2. The Genotype Refinement workflow ### Overview ### Input Begin with recalibrated variants from VQSR at the end of the best practices pipeline. The filters applied by VQSR will be carried through the Genotype Refinement workflow. ### Step 1: Derive posterior probabilities of genotypes #### Tool used: CalculateGenotypePosteriors Using the Phred-scaled genotype likelihoods (PLs) for each sample, prior probabilities for a sample taking on a HomRef, Het, or HomVar genotype are applied to derive the posterior probabilities of the sample taking on each of those genotypes. A sample’s PLs were calculated by HaplotypeCaller using only the reads for that sample. By introducing additional data like the allele counts from the 1000 Genomes project and the PLs for other individuals in the sample’s pedigree trio, those estimates of genotype likelihood can be improved based on what is known about the variation of other individuals. SNP calls from the 1000 Genomes project capture the vast majority of variation across most human populations and can provide very strong priors in many cases. At sites where most of the 1000 Genomes samples are homozygous variant with respect to the reference genome, the probability of a sample being analyzed of also being homozygous variant is very high. For a sample for which both parent genotypes are available, the child’s genotype can be supported or invalidated by the parents’ genotypes based on Mendel’s laws of allele transmission. Even the confidence of the parents’ genotypes can be recalibrated, such as in cases where the genotypes output by HaplotypeCaller are apparent Mendelian violations. ### Step 2: Filter low quality genotypes #### Tool used: VariantFiltration After the posterior probabilities are calculated for each sample at each variant site, genotypes with GQ < 20 based on the posteriors are filtered out. GQ20 is widely accepted as a good threshold for genotype accuracy, indicating that there is a 99% chance that the genotype in question is correct. Tagging those low quality genotypes indicates to researchers that these genotypes may not be suitable for downstream analysis. However, as with the VQSR, a filter tag is applied, but the data is not removed from the VCF. ### Step 3: Annotate possible de novo mutations #### Tool used: VariantAnnotator Using the posterior genotype probabilities, possible de novo mutations are tagged. Low confidence de novos have child GQ >= 10 and AC < 4 or AF < 0.1%, whichever is more stringent for the number of samples in the dataset. High confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion. ### Step 4: Functional annotation of possible biological effects #### Tool used: SnpEff (non-GATK) Especially in the case of de novo mutation detection, analysis can benefit from the functional annotation of variants to restrict variants to exons and surrounding regulatory regions. The GATK currently does not feature integration with any functional annotation tool, but SnpEff is a useful utility that operates in a way very similar to the GATK and integrates readily with the GATK VCF output. ## 3. Output annotations The Genotype Refinement Pipeline adds several new info- and format-level annotations to each variant. GQ fields will be updated, and genotypes calculated to be highly likely to be incorrect will be changed. The Phred-scaled genotype likelihoods (PLs) carry through the pipeline without being changed. In this way, PLs can be used to derive the original genotypes in cases where sample genotypes were changed. ### Population Priors New INFO field annotation PG is a vector of the Phred-scaled prior probabilities of a sample at that site being HomRef, Het, and HomVar. These priors are based on the input samples themselves along with data from the supporting samples if the variant in question overlaps another in the supporting dataset. ### Phred-Scaled Posterior Probability New FORMAT field annotation PP is the Phred-scaled posterior probability of the sample taking on each genotype for the given variant context alleles. The PPs represent a better calibrated estimate of genotype probabilities than the PLs are recommended for use in further analyses instead of the PLs. ### Genotype Quality Current FORMAT field annotation GQ is updated based on the PPs. The calculation is the same as for GQ based on PLs. ### Joint Trio Likelihood New FORMAT field annotation JL is the Phred-scaled joint likelihood of the posterior genotypes for the trio being incorrect. This calculation is based on the PLs produced by HaplotypeCaller (before application of priors), but the genotypes used come from the posteriors. The goal of this annotation is to be used in combination with JP to evaluate the improvement in the overall confidence in the trio’s genotypes after applying CalculateGenotypePosteriors. The calculation of the joint likelihood is given as: $$-10*\log ( 1-GL_{mother}[\text{Posterior mother GT}] * GL_{father}[\text{Posterior father GT}] * GL_{child}[\text{Posterior child GT}] )$$ where the GLs are the genotype likelihoods in [0, 1] probability space. ### Joint Trio Posterior New FORMAT field annotation JP is the Phred-scaled posterior probability of the output posterior genotypes for the three samples being incorrect. The calculation of the joint posterior is given as: $$-10*\log (1-GP_{mother}[\text{Posterior mother GT}] * GP_{father}[\text{Posterior father GT}] * GP_{child}[\text{Posterior child GT}] )$$ where the GPs are the genotype posteriors in [0, 1] probability space. ### Low Genotype Quality New FORMAT field filter lowGQ indicates samples with posterior GQ less than 20. Filtered samples tagged with lowGQ are not recommended for use in downstream analyses. ### High and Low Confidence De Novo New INFO field annotation for sites at which at least one family has a possible de novo mutation. Following the annotation tag is a list of the children with de novo mutations. High and low confidence are output separately. ## 4. Example Before: 1 1226231 rs13306638 G A 167563.16 PASS AC=2;AF=0.333;AN=6;… GT:AD:DP:GQ:PL 0/0:11,0:11:0:0,0,249 0/0:10,0:10:24:0,24,360 1/1:0,18:18:60:889,60,0  After: 1 1226231 rs13306638 G A 167563.16 PASS AC=3;AF=0.500;AN=6;…PG=0,8,22;… GT:AD:DP:GQ:JL:JP:PL:PP 0/1:11,0:11:49:2:24:0,0,249:49,0,287 0/0:10,0:10:32:2:24:0,24,360:0,32,439 1/1:0,18:18:43:2:24:889,60,0:867,43,0  The original call for the child (first sample) was HomRef with GQ0. However, given that, with high confidence, one parent is HomRef and one is HomVar, we expect the child to be heterozygous at this site. After family priors are applied, the child’s genotype is corrected and its GQ is increased from 0 to 49. Based on the allele frequency from 1000 Genomes for this site, the somewhat weaker population priors favor a HomRef call (PG=0,8,22). The combined effect of family and population priors still favors a Het call for the child. The joint likelihood for this trio at this site is two, indicating that the genotype for one of the samples may have been changed. Specifically a low JL indicates that posterior genotype for at least one of the samples was not the most likely as predicted by the PLs. The joint posterior value for the trio is 24, which indicates that the GQ values based on the posteriors for all of the samples are at least 24. (See above for a more complete description of JL and JP.) ## 5. More information about priors The Genotype Refinement Pipeline uses Bayes’s Rule to combine independent data with the genotype likelihoods derived from HaplotypeCaller, producing more accurate and confident genotype posterior probabilities. Different sites will have different combinations of priors applied based on the overlap of each site with external, supporting SNP calls and on the availability of genotype calls for the samples in each trio. ### Input-derived Population Priors If the input VCF contains at least 10 samples, then population priors will be calculated based on the discovered allele count for every called variant. ### Supporting Population Priors Priors derived from supporting SNP calls can only be applied at sites where the supporting calls overlap with called variants in the input VCF. The values of these priors vary based on the called reference and alternate allele counts in the supporting VCF. Higher allele counts (for ref or alt) yield stronger priors. ### Family Priors The strongest family priors occur at sites where the called trio genotype configuration is a Mendelian violation. In such a case, each Mendelian violation configuration is penalized by a de novo mutation probability (currently 10-6). Confidence also propagates through a trio. For example, two GQ60 HomRef parents can substantially boost a low GQ HomRef child and a GQ60 HomRef child and parent can improve the GQ of the second parent. Application of family priors requires the child to be called at the site in question. If one parent has a no-call genotype, priors can still be applied, but the potential for confidence improvement is not as great as in the 3-sample case. ### Caveats Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios. ### Overview This document describes the mathematical details of the methods involved in the Genotype Refinement workflow. For an explanation of the purpose and general principles involved in this workflow, please see the main Genotype Refinement workflow article. For step-by-step instructions on how to apply this workflow to your data, please see the Genotype Refinement tutorial. ## 1. Review of Bayes’s Rule HaplotypeCaller outputs the likelihoods of observing the read data given that the genotype is actually HomRef, Het, and HomVar. To convert these quantities to the probability of the genotype given the read data, we can use Bayes’s Rule. Bayes’s Rule dictates that the probability of a parameter given observed data is equal to the likelihood of the observations given the parameter multiplied by the prior probability that the parameter takes on the value of interest, normalized by the prior times likelihood for all parameter values: $$P(\theta|Obs) = \frac{P(Obs|\theta)P(\theta)}{\sum_{\theta} P(Obs|\theta)P(\theta)}$$ In the best practices pipeline, we interpret the genotype likelihoods as probabilities by implicitly converting the genotype likelihoods to genotype probabilities using non-informative or flat priors, for which each genotype has the same prior probability. However, in the Genotype Refinement Pipeline we use independent data such as the genotypes of the other samples in the dataset, the genotypes in a “gold standard” dataset, or the genotypes of the other samples in a family to construct more informative priors and derive better posterior probability estimates. ## 2. Calculation of Population Priors Given a set of samples in addition to the sample of interest (ideally non-related, but from the same ethnic population), we can derive the prior probability of the genotype of the sample of interest by modeling the sample’s alleles as two independent draws from a pool consisting of the set of all the supplemental samples’ alleles. (This follows rather naturally from the Hardy-Weinberg assumptions.) Specifically, this prior probability will take the form of a multinomial Dirichlet distribution parameterized by the allele counts of each allele in the supplemental population. In the biallelic case the priors can be calculated as follows: $$P(GT = HomRef) = \dbinom{2}{0} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 2)}{\Gamma(nSamples + 2)\Gamma(RefCount)}$$ $$P(GT = Het) = \dbinom{2}{1} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 1)\Gamma(AltCount + 1)}{\Gamma(nSamples + 2)\Gamma(RefCount)\Gamma(AltCount)}$$ $$P(GT = HomVar) = \dbinom{2}{2} \ln \frac{\Gamma(nSamples)\Gamma(AltCount + 2)}{\Gamma(nSamples + 2)\Gamma(AltCount)}$$ where Γ is the Gamma function, an extension of the factorial function. The prior genotype probabilities based on this distribution scale intuitively with number of samples. For example, a set of 10 samples, 9 of which are HomRef yield a prior probability of another sample being HomRef with about 90% probability whereas a set of 50 samples, 49 of which are HomRef yield a 97% probability of another sample being HomRef. ## 3. Calculation of Family Priors Given a genotype configuration for a given mother, father, and child trio, we set the prior probability of that genotype configuration as follows: $$P(G_M,G_F,G_C) = P(\vec{G}) \cases{ 1-10\mu-2\mu^2 & no MV \cr \mu & 1 MV \cr \mu^2 & 2 MVs}$$ where the 10 configurations with a single Mendelian violation are penalized by the de novo mutation probability μ and the two configurations with two Mendelian violations by μ^2. The remaining configurations are considered valid and are assigned the remaining probability to sum to one. This prior is applied to the joint genotype combination of the three samples in the trio. To find the posterior for any single sample, we marginalize over the remaining two samples as shown in the example below to find the posterior probability of the child having a HomRef genotype: $$P(G_C = HomRef | \vec{D}) = \frac{L_C(G_C = HomRef) \sum_{G_F,G_M} L_F(G_F)L_M(G_M)P(\vec{G})}{\sum_{\vec{H}}P(\vec{D}|\vec{H})P(\vec{H})}$$ This quantity P(Gc|D) is calculated for each genotype, then the resulting vector is Phred-scaled and output as the Phred-scaled posterior probabilities (PPs). ## 4. Order of the workflow Family priors are calculated and applied before population priors. The opposite ordering results in overly strong population priors because they are applied to the child and parents and then compounded when the trio likelihoods are multiplied together. ### Introduction Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource. The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset. Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table. ### Command-line arguments Usage of GenotypeAndValidate and its command line arguments are described here. ### The VCF Annotations The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F. ### The Outputs GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this: ALT REF Predictive Value called alt True Positive (TP) False Positive (FP) Positive PV called ref False Negative (FN) True Negative (TN) Negative PV The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose. The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed. The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples). ### Additional Details • You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode) • The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF). ### Examples Genotypes BAM file from new technology using the VCF as a truth dataset: java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myNewTechReads.bam \ -alleles handAnnotatedVCF.vcf \ -BTI alleles \ -o gav.vcf  An annotated VCF example (info field clipped for clarity) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.  Using a BAM file as the truth dataset: java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myTruthDataset.bam \ -alleles callsToValidate.vcf \ -BTI alleles \ -bt \ -o gav.vcf  Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker: This document describes the methods involved in variant calling as performed by the HaplotypeCaller. Please note that we are still working on producing supporting figures to help explain the sometimes complex operations involved. ### Overview The core operations performed by HaplotypeCaller can be grouped into these major steps: 1. Define active regions. The program determines which regions of the genome it needs to operate on, based on the presence of significant evidence for variation. 2. Determine haplotypes by re-assembly of the active region. For each ActiveRegion, the program builds a De Bruijn-like graph to reassemble the ActiveRegion and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites. 3. Determine likelihoods of the haplotypes given the read data. For each ActiveRegion, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles per read for each potentially variant site. 4. Assign sample genotypes. For each potentially variant site, the program applies Bayes’ rule, using the likelihoods of alleles given the read data to calculate the posterior likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample. ### 1. Define active regions In this first step, the program traverses the sequencing data to identify regions of the genomes in which the samples being analyzed show substantial evidence of variation relative to the reference. The resulting areas are defined as “active regions”, and will be passed on to the next step. Areas that do not show any variation beyond the expected levels of background noise will be skipped in the next step. This aims to accelerate the analysis by not wasting time performing reassembly on regions that are identical to the reference anyway. To define these active regions, the program operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints. For more details on how the activity profile is computed and processed, as well as what options are available to modify the active region parameters, please see this method article. Note that the process for determining active region intervals is modified slightly when HaplotypeCaller is run in one of the special modes, e.g. the reference confidence mode (-ERC GVCF or ERC BP_RESOLUTION), Genotype Given Alleles (-gt_mode GENOTYPE_GIVEN_ALLELES) or when active regions are triggered using advanced arguments such as -allelesTrigger, --forceActive or --activeRegionIn. This is covered in the method article referenced above. Once this process is complete, the program applies a few post-processing steps to finalize the the active regions (see detailed doc above). The final output of this process is a list of intervals corresponding to the active regions which will be processed in the next step. ### 2. Determine haplotypes by re-assembly of the active region. The goal of this step is to reconstruct the possible sequences of the real physical segments of DNA present in the original sample organism. To do this, the program goes through each active region and uses the input reads that mapped to that region to construct complete sequences covering its entire length, which are called haplotypes. This process will typically generate several different possible haplotypes for each active region due to: • real diversity on polyploid (including CNV) or multi-sample data • possible allele combinations between variant sites that are not totally linked within the active region • sequencing and mapping errors In order to generate a list of possible haplotypes, the program first builds an assembly graph for the active region using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used for scoring and genotyping in the next step. The assembly and haplotype determination procedure is described in full detail in this method article. Once the haplotypes have been determined, each one is realigned against the original reference sequence in order to identify potentially variant sites. This produces the set of sites that will be processed in the next step. A subset of these sites will eventually be emitted as variant calls to the output VCF. ### 3. Evaluating the evidence for haplotypes and variant alleles Now that we have all these candidate haplotypes, we need to evaluate how much evidence there is in the data to support each one of them. So the program takes each individual read and aligns it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm, which takes into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). This outputs a score for each read-haplotype pairing, expressing the likelihood of observing that read given that haplotype. Those scores are then used to calculate out how much evidence there is for individual alleles at the candidate sites that were identified in the previous step. The process is called marginalization over alleles and produces the actual numbers that will finally be used to assign a genotype to the sample in the next step. For further details on the pairHMM output and the marginalization process, see this document. ### 4. Assigning per-sample genotypes The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted. For further details on the genotyping calculations, see this document. This concludes the overview of how HaplotypeCaller works. This document describes the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. ### Summary To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints. ### 1. Calculating the raw activity profile Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus. In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment. If using the mode for genotyping given alleles (GGA) or the advanced-level flag -useAlleleTrigger, and the site is overlapped by an allele in the VCF file provided through the -alleles argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0. This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the -L argument). ### 2. Smoothing the activity profile The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position. 1. Unless one of the special modes is enabled (GGA or allele triggering), the position profile value will be copied over to adjacent regions if enough high quality soft-clipped bases immediately precede or follow that position in the original alignment. At time of writing, high-quality soft-clipped bases are those with quality score of Q29 or more. We consider that there are enough of such a soft-clips when the average number of high quality bases per soft-clip is 7 or more. In this case the site profile value is copied to all bases within a radius of that position as large as the average soft-clip length without exceeding a maximum of 50bp. 2. Each profile value is then divided and spread out using a Gaussian kernel covering up to 50bp radius centered at its current position with a standard deviation, or sigma, set using the -bandPassSigma argument (current default is 17 bp). The larger the sigma, the broader the spread will be. 3. For each position, the final smoothed value is calculated as the sum of all its profile values after steps 1 and 2. ### 3. Setting the ActiveRegion thresholds and intervals The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using -activeRegionMaxSize). • If the region size falls within the limits we leave it untouched (it's good to go). • If the region size is shorter than the minimum, it is greedily extended forward ignoring that cut point and we come back to step 1. Only if this is not possible because we hit a hard-limit (end of the chromosome or requested analysis interval) we will accept the small region as it is. • If it is too long, we find the lowest local minimum between the maximum and minimum region size. A local minimum is a profile value preceded by a large one right up-stream (-1bp) and an equal or larger value down-stream (+1bp). In case of a tie, the one further downstream takes precedence. If there is no local minimum we simply force the cut so that the region has the maximum active region size. Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode. There is a final post-processing step to clean up and trim the ActiveRegion: • Remove bases at each end of the read (hard-clipping) until there a base with a call quality equal or greater than minimum base quality score (customizable parameter -mbq, 10 by default). • Include or exclude remaining soft-clipped ends. Soft clipped ends will be used for assembly and calling unless the user has requested their exclusion (using -dontUseSoftClippedBases), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL). • Clip off adaptor sequences of the read if present. • Discard all reads that no longer overlap with the ActiveRegion after the trimming operations described above. • Downsample remaining reads to a maximum of 1000 reads per sample, but respecting a minimum of 5 reads starting per position. This is performed after any downsampling by the traversal itself (-dt, -dfrac, -dcov etc.) and cannot be overriden from the command line. This document details the procedure used by HaplotypeCaller to re-assemble read data and determine candidate haplotypes as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. Note that we are still working on producing figures to complement the text. We will update this document as soon as the figures are ready. Note also that this is a provisional document and some final corrections may be made for accuracy and/or completeness. Feedback is most welcome! ### Overview The previous step produced a list of ActiveRegions that showed some evidence of possible variation (see step 1 documentation). Now, we need to process each Active Region in order to generate a list of possible haplotypes based on the sequence data we have for that region. To do so, the program first builds an assembly graph for each active region (determined in the previous step) using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used to call variants and assign per-sample genotypes in the next steps. ### 1. Reference graph assembly First, we construct the reference assembly graph, which starts out as a simple directed DeBruijn graph. This involves decomposing the reference sequence into a succession of kmers (pronounced "kay-mers"), which are small sequence subunits that are k bases long. Each kmer sequence overlaps the previous kmer by k-1 bases. The resulting graph can be represented as a series of nodes and connecting edges indicating the sequential relationship between the adjacent bases. At this point, all the connecting edges have a weight of 0. In addition to the graph, we also build a hash table of unique kmers, which we use to keep track of the position of nodes in the graph. At the beginning, the hash table only contains unique kmers found in the reference sequence, but we will add to it in the next step. A note about kmer size: by default, the program will attempt to build two separate graphs, using kmers of 10 and 25 bases in size, respectively, but other kmer sizes can be specified from the command line with the -kmerSize argument. The final set of haplotypes will be selected from the union of the graphs obtained using each k. ### 2. Threading reads through the graph This is where our simple reference graph turns into a read-threading graph, so-called because we're going to take each read in turn and try to match it to a path in the graph. We start with the first read and compare its first kmer to the hash table to find if it has a match. If there is a match, we look up its position in the reference graph and record that position. If there is no match, we consider that it is a new unique kmer, so we add that unique kmer to the hash table and add a new node to the graph. In both cases, we then move on and repeat the process with the next kmer in the read until we reach the end of the read. When two consecutive kmers in a read belong to two nodes that were already connected by an edge in the graph, we increase the weight of that edge by 1. If the two nodes were not connected yet, we add a new edge to the graph with a starting weight of 1. As we repeat the process on each read in turn, edge weights will accumulate along the paths that are best supported by the read data, which will help us select the most likely paths later on. Note on graph complexity, cycles and non-unique kmers For this process to work properly, we need the graph to be sufficiently complex (where the number of non-unique k-mers is less that 4-fold the number of unique kmers found in the data) and without cycles. In certain genomic regions where there are a lot of repeated sequences, these conditions may not be met, because repeats cause cycles and diminish the number of available unique kmers. If none of the kmer sizes provided results in a viable graph (complex enough and without cycles) the program will automatically try the operation again with larger kmer sizes. Specifically, we take the largest k provided by the user (or by the default settings) and increase it by 10 bases. If no viable graph can be obtained after iterating over increased kmer sizes 6 times, we give up and skip the active region entirely. ### 3. Graph refinement Once all the reads have been threaded through the graph, we need to clean it up a little. The main cleaning-up operation is called pruning (like the gardening technique). The goal of the pruning operation is to remove noise due to errors. The basic idea is that sections of the graph that are supported by very few reads are most probably the result of stochastic errors, so we are going to remove any sections that are supported by fewer than a certain threshold number of reads. By default the threshold value is 2, but this can be controlled from the command line using the -minPruning argument. In practice, this means that linear chains in the graph (linear sequence of vertices and edges without any branching) where all edges have fewer than 2 supporting reads will be removed. Increasing the threshold value will lead to faster processing and higher specificity, but will decrease sensitivity. Decreasing this value will do the opposite, decreasing specificity but increasing sensitivity. Note that if you are calling multiple samples together, the program also looks at how many of the samples support each segment, and only prunes segments for which fewer than a certain number of samples have the minimum required number of supporting reads. By default this sample number is 1, so as long as one sample in the cohort passes the pruning threshold, the segment will NOT be pruned. This is designed to avoid losing singletons (variants that are unique to a single sample in a cohort). This parameter can also be controlled from the command line using the -minPruningSamples argument, but keep in mind that increasing the default value may lead to decreased sensitivity. At this stage, the program can also perform graph refinement operations that are specific to certain special modes. For example, when HaplotypeCaller is run on RNAseq, it will recover dangling heads and tails from the splice junctions to compensate for issues that are related to limitations in graph assembly of that data type. ### 4. Select best haplotypes Now that the graph is all cleaned up, the program builds haplotype sequences by traversing all possible paths in the graph and calculates a likelihood score for each one. This score is calculated as the product of transition probabilities of the path edges, where the transition probability of an edge is computed as the number of reads supporting that edge divided by the sum of the support of all edges that share that same source vertex. In order to limit the amount of computation needed for the next step, we limit the number of haplotypes that will be considered for each value of k (remember that the program builds graphs for multiple kmer sizes). This is easy to do since we conveniently have scores for each haplotype; all we need to do is select the N haplotypes with the best scores. By default that number is very generously set to 128 (so the program would proceed to the next step with up to 128 haplotypes per value of k) but this can be adjusted from the command line using the -maxNumHaplotypesInPopulation argument. You would mainly want to decrease this number in order to improve speed; increasing that number would rarely be reasonable, if ever. ### 5. Identify potential variation sites Once we have a list of plausible haplotypes, we perform a Smith-Waterman alignment (SWA) of each haplotype to the original reference sequence across the active region in order to reconstruct a CIGAR string for the haplotype. Note that indels will be left-aligned; that is, their start position will be set as the leftmost position possible. This finally yields the potential variation sites that will be put through the variant modeling step next, bringing us back to the "classic" variant calling methods (as used by GATK's UnifiedGenotyper and Samtools' mpileup). Note that this list of candidate sites is essentially a super-set of what will eventually be the final set of called variants. Every site that will be called variant is in the super-set, but not every site that is in the super-set will be called variant. This document describes the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. ### Overview The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample. ### 1. Evaluating the evidence for each candidate haplotype We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step. Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. (example figure TBD) The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read. ### 2. Evaluating the evidence for each candidate site and corresponding alleles Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM. This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods. (example fig TBD) At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step. This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in -ERC modes (GVCF and BP_RESOLUTION) please see the reference confidence model documentation. ### Overview The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted. ### 1. Preliminary assumptions / limitations #### Quality Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values). #### Ploidy Both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid by default, but desired ploidy can be set using the -ploidy argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation. The generalized form of the genotyping algorithm that can handle ploidies other than 2 is available as of version 3.3-0. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered (currently, max ploidy = 20 for max alt alleles = 6). Future developments will aim to mitigate those limitations. #### Paired end reads Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations. ### 2. Calculating genotype likelihoods using Bayes' Theorem We use the approach described in [Li2011] to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation. The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is: $$P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) }$$ If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left: $$P(G|D)$$ is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D. Now let's break down the term on the right: $$\frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) }$$ We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction): $$P(G) P(D|G)$$ which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype. The first one is the easiest to understand. The prior probability of the genotype G: $$P(G)$$ represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying. The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype, $$P(D|G)$$ is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula: $$P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right)$$ You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into: $$G = H_1H_2$$ which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes. Now, back to our calculation, what's left to figure out is this: $$P(D_j|H_n)$$ which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain: $$P(G|D)$$ for the genotype G. ### 3. Selecting a genotype and emitting the call record We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A, T, and C at a site, the possible genotypes are AA, AT, AC, TT, TC, and CC, and we end up with 6 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample. Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation. Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record. For details of how these metrics are calculated, please see the variant annotations documentation. WARNING: unfortunately we do not have the resources to directly support the HLA typer at this time. As such this tool is no longer under active development or supported by our group. The source code is available in the GATK as is. This tool may or may not work without substantial experimentation by an analyst. ### Introduction Inherited DNA sequence variation in the major histocompatibilty complex (MHC) on human chromosome 6 significantly influences the inherited risk for autoimmune diseases and the host response to pathogenic infections. Collecting allelic sequence information at the classical human leukocyte antigen (HLA) genes is critical for matching in organ transplantation and for genetic association studies, but is complicated due to the high degree of polymorphism across the MHC. Next-generation sequencing offers a cost-effective alternative to Sanger-based sequencing, which has been the standard for classical HLA typing. To bridge the gap between traditional typing and newer sequencing technologies, we developed a generic algorithm to call HLA alleles at 4-digit resolution from next-generation sequence data. ### Downloading the HLA tools The HLA-specific walkers/tools (FindClosestHLA, CalculateBaseLikelihoods, and HLACaller) are available as a separate download from our FTP site and as source code only. Instructions for obtaining and compiling them are as follows: Download the source code (in a tar ball): location: ftp://gsapubftp-anonymous@ftp.broadinstitute.org password: <blank> subdirectory: HLA/ Untar the file, 'cd' into the untar'ed directory and compile with 'ant' Remember that we no longer support this tool, so if you encounter issues with any of these steps please do NOT post them to our support forum. ### The algorithm Algorithmic components of the HLA caller: The HLA caller algorithm, developed as part of the open-source GATK, examines sequence reads aligned to the classical HLA loci taking SAM/BAM formatted files as input and calculates, for each locus, the posterior probabilities for all pairs of classical alleles based on three key considerations: (1) genotype calls at each base position, (2) phase information of nearby variants, and (3) population-specific allele frequencies. See the diagram below for a visualization of the heuristic. The output of the algorithm is a list of HLA allele pairs with the highest posterior probabilities. Functionally, the HLA caller was designed to run in three steps: 1. The "FindClosestAllele" walker detects misaligned reads by comparing each read to the dictionary of HLA alleles (reads with < 75% SNP homology to the closest matching allele are removed) 2. The "CalculateBaseLikelihoods" walker calculates the likelihoods for each genotype at each position within the HLA loci and finds the polymorphic sites in relation to the reference 3. The "HLAcaller" walker reads the output of the previous steps, and makes the likelihood / probability calculations based on base genotypes, phase information, and allele frequencies. ### Required inputs These inputs are required: • Aligned sequence (.bam) file - input data • Genomic reference (.bam) file - human genome build 36. • HLA exons (HLA.intervals) file - list of HLA loci / exons to examine. • HLA dictionary - list of HLA alleles, DNA sequences, and genomic positions. • HLA allele frequencies - allele frequencies for HLA alleles across multiple populations. • HLA polymorphic sites - list of polymorphic sites (used by FindClosestHLA walker) You can download the latter four here. ### Usage and Arguments #### Standard GATK arguments (applies to subsequent functions) The GATK contains a wealth of tools for analysis of sequencing data. Required inputs include an aligned bam file and reference fasta file. The following example shows how to calculate depth of coverage. Usage: java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I input.bam -R ref.fasta -L input.intervals > output.doc  Arguments: • -T (required) name of walker/function • -I (required) Input (.bam) file. • -R (required) Genomic reference (.fasta) file. • -L (optional) Interval or list of genomic intervals to run the genotyper on. #### FindClosestHLA The FindClosestHLA walker traverses each read and compares it to all overlapping HLA alleles (at specific polymorphic sites), and identifies the closest matching alleles. This is useful for detecting misalignments (low concordance with best-matching alleles), and helps narrow the list of candidate alleles (narrowing the search space reduces computational speed) for subsequent analysis by the HLACaller walker. Inputs include the HLA dictionary, a list of polymorphic sites in the HLA, and the exons of interest. Output is a file (output.filter) that includes the closest matching alleles and statistics for each read. Usage: java -jar GenomeAnalysisTK.jar -T FindClosestHLA -I input.bam -R ref.fasta -L HLA_EXONS.intervals -HLAdictionary HLA_DICTIONARY.txt \ -PolymorphicSites HLA_POLYMORPHIC_SITES.txt -useInterval HLA_EXONS.intervals | grep -v INFO > output.filter  Arguments: • -HLAdictionary (required) HLA_DICTIONARY.txt file • -PolymorphicSites (required) HLA_POLYMORPHIC_SITES.txt file • -useInterval (required) HLA_EXONS.intervals file #### CalculateBaseLikelihoods CalculateBestLikelihoods walker traverses each base position to determine the likelihood for each of the 10 diploid genotypes. These calculations are used later by HLACaller to determine likelihoods for HLA allele pairs based on genotypes, as well as determining the polymorphic sites used in the phasing algorithm. Inputs include aligned bam input, (optional) results from FindClosestHLA (to remove misalignments), and cutoff values for inclusion or exclusion of specific reads. Output is a file (output.baselikelihoods) that contains base likelihoods at each position. Usage: java -jar GenomeAnalysisTK.jar -T CalculateBaseLikelihoods -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter \ -maxAllowedMismatches 6 -minRequiredMatches 0 | grep -v "INFO" | grep -v "MISALIGNED" > output.baselikelihoods  Arguments: • -filter (optional) file = output of FindClosestHLA walker (output.filter - to exclude misaligned reads in genotype calculations) • -maxAllowedMismatches (optional) max number of mismatches tolerated between a read and the closest allele (default = 6) • -minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 0) #### HLACaller The HLACaller walker calculates the likelihoods for observing pairs of HLA alleles given the data based on genotype, phasing, and allele frequency information. It traverses through each read as part of the phasing algorithm to determine likelihoods based on phase information. The inputs include an aligned bam files, the outputs from FindClosestHLA and CalculateBaseLikelihoods, the HLA dictionary and allele frequencies, and optional cutoffs for excluding specific reads due to misalignment (maxAllowedMismatches and minRequiredMatches). Usage: java -jar GenomeAnalysisTK.jar -T HLACaller -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter -baselikelihoods output.baselikelihoods\ -maxAllowedMismatches 6 -minRequiredMatches 5 -HLAdictionary HLA_DICTIONARY.txt -HLAfrequencies HLA_FREQUENCIES.txt | grep -v "INFO" > output.calls  Arguments: • -baseLikelihoods (required) output of CalculateBaseLikelihoods walker (output.baselikelihoods - genotype likelihoods / list of polymorphic sites from the data) • -HLAdictionary (required) HLA_DICTIONARY.txt file • -HLAfrequencies (required) HLA_FREQUENCIES.txt file • -useInterval (required) HLA_EXONS.intervals file • -filter (optional) file = output of FindClosestAllele walker (to exclude misaligned reads in genotype calculations) • -maxAllowedMismatches (optional) max number of mismatched bases tolerated between a read and the closest allele (default = 6) • -minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 5) • -minFreq (option) minimum allele frequency required to consider the HLA allele (default = 0.0). #### Example workflow (genome-wide HiSeq data in NA12878 from HapMap; computations were performed on the Broad servers) Extract sequences from the HLA loci and make a new bam file: use Java-1.6 set HLA=/seq/NKseq/sjia/HLA_CALLER set GATK=/seq/NKseq/sjia/Sting/dist/GenomeAnalysisTK.jar set REF=/humgen/1kg/reference/human_b36_both.fasta cpHLA/samheader NA12878.HLA.sam

java -jar $GATK -T PrintReads \ -I /seq/dirseq/ftp/NA12878_exome/NA12878.bam -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta \ -L$HLA/HLA.intervals | grep -v RESULT | sed 's/chr6/6/g' >> NA12878.HLA.sam



Use FindClosestHLA to find closest matching HLA alleles and to detect possible misalignments:

java -jar $GATK -T FindClosestHLA -I NA12878.HLA.bam -R$REF -L $HLA/HLA_EXONS.intervals -useInterval$HLA/HLA_EXONS.intervals \
-HLAdictionary $HLA/HLA_DICTIONARY.txt -PolymorphicSites$HLA/HLA_POLYMORPHIC_SITES.txt | grep -v INFO > NA12878.HLA.filter

READ_NAME   START-END       S   %Match  Matches Discord Alleles
20FUKAAXX100202:7:63:8309:75917 30018423-30018523   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...
20GAVAAXX100126:3:24:13495:18608    30018441-30018541   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,...
20FUKAAXX100202:8:44:16857:92134    30018442-30018517   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,HLA_A*010105,...
20FUKAAXX100202:8:5:4309:85338  30018452-30018552   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20GAVAAXX100126:3:28:7925:160832    30018453-30018553   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20FUKAAXX100202:1:2:10539:169258    30018459-30018530   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,.
20FUKAAXX100202:8:43:18611:44456    30018460-30018560   1.0 1.000   3   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...


Use CalculateBaseLikelihoods to determine genotype likelihoods at every base position:

java -jar $GATK -T CalculateBaseLikelihoods -I NA12878.HLA.bam -R$REF -L $HLA/HLA_EXONS.intervals \ -filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 0 | grep -v INFO | grep -v MISALIGNED > NA12878.HLA.baselikelihoods chr:pos Ref Counts AA AC AG AT CC CG CT GG GT TT 6:30018513 G A[0]C[0]T[1]G[39] -113.58 -113.58 -13.80 -113.29 -113.58 -13.80 -113.29 -3.09 -13.50 -113.11 6:30018514 C A[0]C[39]T[0]G[0] -119.91 -13.00 -119.91 -119.91 -2.28 -13.00 -13.00 -119.91 -119.91 -119.91 6:30018515 T A[0]C[0]T[39]G[0] -118.21 -118.21 -118.21 -13.04 -118.21 -118.21 -13.04 -118.21 -13.04 -2.35 6:30018516 C A[0]C[38]T[1]G[0] -106.91 -13.44 -106.91 -106.77 -3.05 -13.44 -13.30 -106.91 -106.77 -106.66 6:30018517 C A[0]C[38]T[0]G[0] -103.13 -13.45 -103.13 -103.13 -3.64 -13.45 -13.45 -103.13 -103.13 -103.13 6:30018518 C A[0]C[38]T[0]G[0] -112.23 -12.93 -112.23 -112.23 -2.71 -12.93 -12.93 -112.23 -112.23 -112.23 ...  Run HLACaller using outputs from previous steps to determine the most likely alleles at each locus: java -jar$GATK -T HLACaller -I NA12878.HLA.bam -R $REF -L$HLA/HLA_EXONS.intervals -useInterval $HLA/HLA_EXONS.intervals \ -bl NA12878.HLA.baselikelihoods -filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 5 \ -HLAdictionary$HLA/HLA_DICTIONARY.txt -HLAfrequencies $HLA/HLA_FREQUENCIES.txt > NA12878.HLA.info grep -v INFO NA12878.HLA.info > NA12878.HLA.calls Locus A1 A2 Geno Phase Frq1 Frq2 L Prob Reads1 Reads2 Locus EXP White Black Asian A 0101 1101 -1229.5 -15.2 -0.82 -0.73 -1244.7 1.00 180 191 229 1.62 -1.99 -3.13 -2.07 B 0801 5601 -832.3 -37.3 -1.01 -2.15 -872.1 1.00 58 59 100 1.17 -3.31 -4.10 -3.95 C 0102 0701 -1344.8 -37.5 -0.87 -0.86 -1384.2 1.00 91 139 228 1.01 -2.35 -2.95 -2.31 DPA1 0103 0201 -842.1 -1.8 -0.12 -0.79 -846.7 1.00 72 48 120 1.00 -0.90 -INF -1.27 DPB1 0401 1401 -991.5 -18.4 -0.45 -1.55 -1010.7 1.00 64 48 113 0.99 -2.24 -3.14 -2.64 DQA1 0101 0501 -1077.5 -15.9 -0.90 -0.62 -1095.4 1.00 160 77 247 0.96 -1.53 -1.60 -1.87 DQB1 0201 0501 -709.6 -18.6 -0.77 -0.76 -729.7 0.95 50 87 137 1.00 -1.76 -1.54 -2.23 DRB1 0101 0301 -1513.8 -317.3 -1.06 -0.94 -1832.6 1.00 52 32 101 0.83 -1.99 -2.83 -2.34  Make a SAM/BAM file of the called alleles: awk '{if (NR > 1){print$1 "*" $2 "\n"$1 "*" $3}}' NA12878.HLA.calls | sort -u > NA12878.HLA.calls.unique cp$HLA/samheader NA12878.HLA.calls.sam
awk '{split($1,a,"*"); print "grep \"" a[1] "[*]" a[2] "\" '$HLA/HLA_DICTIONARY.sam' >> 'NA12878.HLA'.tmp";}' NA12878.HLA.calls.unique | sh
sort -k4 -n NA12878.HLA.tmp >> NA12878.HLA.calls.sam
rm NA12878.HLA.tmp


There exist a few performance / accuracy tradeoffs in the HLA caller, as in any algorithm. The following are a few key considerations that the user should keep in mind when using the software for HLA typing.

#### Robustness to sequencing/alignment artifact vs. Ability to recognize rare alleles

In polymorphic regions of the genome like the HLA, misaligned reads (presence of erroneous reads or lack of proper sequences) and sequencing errors (indels, systematic PCR errors) may cause the HLA caller to call rare alleles with polymorphisms at the affected bases. The user can manually spot these errors when the algorithm calls a rare allele (Frq1 and Frq2 columns in the output of HLACaller indicate log10 of the allele frequencies). Alternatively, the user can choose to consider only non-rare alleles (use the "-minFreq 0.001" option in HLACaller) to make the algorithm (faster and) more robust against sequencing or alignment errors. The drawback to this approach is that the algorithm may not be able to correctly identifying rare alleles when they are truly present. We recommend using the -minFreq option for genome-wide sequencing datasets, but not for high-quality (targeted PCR 454) data specifically captured for HLA typing in large cohorts.

#### Misalignment Detection and Data Pre-Processing

The FindClosestAllele walker (optional step) is recommended for two reasons:

1. The ability to detect misalignments for reads that don't match very well to the closest appearing HLA allele - removing these misaligned reads improves calling accuracy.

2. Creating a list of closest-matching HLA alleles reduces the search space (over 3,000 HLA alleles across the class I and class II loci) that HLACaller has to iterate through, reducing computational burdon.

However, using this pre-processing step is not without costs:

1. Any cutoff chosen for %concordance, min base matches, or max base mismatches will not distinguish between correctly aligned and misaligned reads 100% of the time - there is a chance that correctly aligned reads may be removed, and misaligned reads not removed.

2. The list of closest-matching alleles in some cases may not contain the true allele if there is sufficient sequencing error, in which case the true allele will not be considered by the HLACaller walker. In our experience, the advantages of using this pre-processing FindClosestAllele walker greatly outweighs the disadvantages, as recommend including it in the pipeline long as the user understands the possible risks of using this function.

### Contributions

The HLA caller algorithm was was developed by Xiaoming (Sherman) Jia with generous support of the GATK team (especially Mark Depristo, Eric Banks), and Paul de Bakker.

• xiaomingjia at gmail dot com
• depristo at broadinstitute dot org
• ebanks at broadinstitute dot org
• pdebakker at rics dot bwh dot harvard dot edu

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

### How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

• Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
• Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

Note: As of version 4, BEAGLE reads and outputs VCF files directly, and can handle multiallelic sites. We have not yet evaluated what this means for the GATK-BEAGLE interface; it is possible that some of the information provided below is no longer applicable as a result.

### Introduction

BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can

• phase genotype data (i.e. infer haplotypes) for unrelated individuals, parent-offspring pairs, and parent-offspring trios.
• infer sporadic missing genotype data.
• impute ungenotyped markers that have been genotyped in a reference panel.
• perform single marker and haplotypic association analysis.
• detect genetic regions that are homozygous-by-descent in an individual or identical-by-descent in pairs of individuals.

The GATK provides an experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.

### Workflow

The basic workflow for this interface is as follows:

After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.

• User needs to run BEAGLE with this likelihood file specified as input.
• After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
• User can then run GATK walker BeagleOutputToVCF to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.

### Producing Beagle input likelihoods file

Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.

For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:

marker    alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892
20:60251    T        C     10.00   1.26    0.00     9.77   2.45    0.00
20:60321    G        T     10.00   5.01    0.01    10.00   0.31    0.00
20:60467    G        C      9.55   2.40    0.00     9.55   1.20    0.00


Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).

IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.

### Running Beagle

We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example

java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun


Extra BEAGLE arguments can be added as required.

Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.

### Processing BEAGLE output files

BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:

# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like


### Creating a new VCF from BEAGLE data with BeagleOutputToVCF

Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.

The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.

The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).

The resulting VCF can now be used for further downstream analysis.

### Merging VCFs broken up by chromosome into a single genome-wide file

Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.

java -jar /path/to/dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R reffile.fasta \
--out genome_wide_output.vcf \
-V:input1 beagle_output_chr1.vcf \
-V:input2 beagle_output_chr2.vcf \
.
.
.
-V:inputX beagle_output_chrX.vcf \
-type UNION -priority input1,input2,...,inputX


This is a placeholder for a document in preparation.

In the meantime, for a quick introduction to the GATK and the purpose of the Best Practices, please see the intro talks from our latest workshop in the Presentations section.

Important notes on context and caveats

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent). In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the FAQ articles for help dealing with other experimental designs. Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

# liftOverVCF.pl

## Introduction

This script converts a VCF file from one reference build to another. It runs 3 modules within our toolkit that are necessary for lifting over a VCF.
1. LiftoverVariants walker
2. sortByRef.pl to sort the lifted-over file
3. Filter out records whose ref field no longer matches the new reference

## Obtaining the Script

The liftOverVCF.pl script is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.

## Example

./liftOverVCF.pl -vcf calls.b36.vcf \
-out calls.hg19.vcf \
-gatk /humgen/gsa-scr1/ebanks/Sting_dev
-newRef /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19
-oldRef /humgen/1kg/reference/human_b36_both


## Usage

Running the script with no arguments will show the usage:

Usage: liftOverVCF.pl
-vcf        <input vcf>
-gatk       <path to gatk trunk>
-chain      <chain file>
-newRef     <path to new reference prefix; we will need newRef.dict, .fasta, and .fasta.fai>
-oldRef     <path to old reference prefix; we will need oldRef.fasta>
-out        <output vcf>
-tmp            <temp file location; defaults to /tmp>

• The 'tmp' argument is optional. It specifies the location to write the temporary file from step 1 of the process.

## Chain files

Chain files from b36/hg18 to hg19 are located here within the Broad:

   /humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/


External users can get them off our ftp site:

   location: ftp.broadinstitute.org
path:     Liftover_Chain_Files


## Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.

## Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.

# Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R /path/to/reference.fasta \
-o /path/to/output.intervals \
-known /path/to/indel_calls.vcf


The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
-jar /path/to/GenomeAnalysisTK.jar \
-I <lane-level.bam> \
-R <ref.fasta> \
-T IndelRealigner \
-targetIntervals <intervalListFromStep1Above.intervals> \
-o <realignedBam.bam> \
-known /path/to/indel_calls.vcf
--consensusDeterminationModel KNOWNS_ONLY \
-LOD 0.4


### Introduction

Three-stage procedure:

• Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.

• Take the master sites VCF and genotype each sample BAM file at these sites

• (Optionally) Merge the single sample VCFs into a master VCF file

### Creating the master set of sites: SNPs and Indels

The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:

Batch 1:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       PASS    .       GT:GQ   0/1:30
20      10001436        .       A       AGG     .       PASS    .       GT:GQ   1/1:30


Batch 2:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30


In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:

Master VCF:

20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30  [pass in both]
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30  [only in batch 1]
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [fail in both]
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [pass in 1, fail in 2, choice in unclear]
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30  [only in batch 2]
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30  [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]


These issues fall into the following categories:

• For sites present in all VCFs (20:9999996 above), the alleles agree, and each site PASS is pass, this site can obviously be considered "PASS" in the master VCF
• Some sites may be PASS in one batch, but absent in others (20:10000000 and 20:10000598), which occurs when the site is polymorphic in one batch but all samples are reference or no-called in the other batch
• Similarly, sites that are fail in all batches in which they occur can be safely filtered out, or included as failing filters in the master VCF (20:10000117)

There are two difficult situations that must be addressed by the needs of the project merging batches:

• Some sites may be PASS in some batches but FAIL in others. This might indicate that either:
• The site is actually truly polymorphic, but due to limited coverage, poor sequencing, or other issues it is flag as unreliable in some batches. In these cases, it makes sense to include the site
• The site is actually a common machine artifact, but just happened to escape standard filtering in a few batches. In these cases, you would obviously like to filter out the site
• Even more complicated, it is possible that in the PASS batches you have found a reliable allele (C/T, for example) while in others there's no alt allele but actually a low-frequency error, which is flagged as failing. Ideally, here you could filter out the failing allele from the FAIL batches, and keep the pass ones
• Some sites may have multiple segregating alleles in each batch. Such sites are often errors, but in some cases may be actual multi-allelic sites, in particular for indels.

Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.

The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:

java -jar dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R human_g1k_v37.fasta \
-V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \
--sites_only \
-minimalVCF \
-o master.vcf


producing the following VCF:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      9999996     .       A       ACT         .       PASS    set=Intersection
20      10000000        .       T       G           .   PASS    set=one
20      10000117        .       C       T           .       FAIL    set=FilteredInAll
20      10000211        .       C       T           .       PASS    set=filterIntwo-one
20      10000598        .       T       A           .       PASS    set=two
20      10001436        .       A       AGG,AGGCT       .       PASS    set=Intersection


### Genotyping your samples at these sites

Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.

As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:

java -Xmx2g -jar dist/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R bundle/b37/human_g1k_v37.fasta \
-I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \
-alleles master.vcf \
-L master.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-out_mode EMIT_ALL_SITES \
-stand_call_conf 0.0 \
-glm BOTH \
-G none \


The -L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.

The last item,-G  prevents the UG from computing annotations you don't need. This command produces something like the following output:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ACT         4576.19 .       .   GT:DP:GQ:PL     1/1:76:99:4576,229,0
20      10000000        .       T       G           0       .       .       GT:DP:GQ:PL     0/0:79:99:0,238,3093
20      10000211        .       C       T       857.79  .       .   GT:AD:DP:GQ:PL  0/1:28,27:55:99:888,0,870
20      10000598        .       T       A           1800.57 .       .   GT:AD:DP:GQ:PL  1/1:0,48:48:99:1834,144,0
20      10001436        .       A       AGG,AGGCT       1921.12 .       .   GT:DP:GQ:PL     0/2:49:84.06:1960,2065,0,2695,222,84


Several things should be noted here:

• The genotype likelihoods calculation evolves, especially for indels, the exact results of this command will change.
• The command will emit sites that are hom-ref in the sample at the site, but the -stand_call_conf 0.0 argument should be provided so that they aren't tagged as "LowQual" by the UnifiedGenotyper.
• The filtered site 10000117 in the master.vcf is not genotyped by the UG, as it doesn't pass filters and so is considered bad by the GATK UG. If you want to determine the genotypes for all sites, independent on filtering, you must unfilter all of your records in master.vcf, and if desired, restore the filter string for these records later.

This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:

foreach sample in samples:
run UnifiedGenotyper command above with -I $sample.bam -o$sample.vcf
end


### (Optional) Merging the sample VCFs together

You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:

java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf


### General notes

• Because the GATK uses dynamic downsampling of reads, it is possible for truly marginal calls to change likelihoods from discovery (processing the BAM incrementally) vs. genotyping (jumping into the BAM). Consequently, do not be surprised to see minor differences in the genotypes for samples from discovery and genotyping.
• More advanced users may want to consider group several samples together for genotyping. For example, 100 samples could be genotyped in 10 groups of 10 samples, resulting in only 10 VCF files. Merging the 10 VCF files may be faster (or just easier to manage) than 1000 individual VCFs.
• Sometimes, using this method, a monomorphic site within a batch will be identified as polymorphic in one or more samples within that same batch. This is because the UnifiedGenotyper applies a frequency prior to determine whether a site is likely to be monomorphic. If the site is monomorphic, it is either not output, or if EMIT_ALL_SITES is thrown, reference genotypes are output. If the site is determined to be polymorphic, genotypes are assigned greedily (as of GATK-v1.4). Calling single-sample reduces the effect of the prior, so sites which were considered monomorphic within a batch could be considered polymorphic within a sub-batch.

### Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

### BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

### Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

### Problems with Variant Calling with Pacific Biosciences

• Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

• Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

• Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

### 1. Introduction

The GATK provides an implementation of the Per-Base Alignment Qualities (BAQ) developed by Heng Li in late 2010. See this SamTools page for more details.

### 2. Using BAQ

The BAQ algorithm is applied by the GATK engine itself, which means that all GATK walkers can potentially benefit from it. By default, BAQ is OFF, meaning that the engine will not use BAQ quality scores at all.

The GATK engine accepts the argument -baq with the following enum values:

public enum CalculationMode {
OFF,                        // don't apply a BAQ at all, the default
CALCULATE_AS_NECESSARY,     // do HMM BAQ calculation on the fly, as necessary, if there's no tag
RECALCULATE                 // do HMM BAQ calculation on the fly, regardless of whether there's a tag present
}


If you want to enable BAQ, the usual thing to do is CALCULATE_AS_NECESSARY, which will calculate BAQ values if they are not in the BQ read tag. If your reads are already tagged with BQ values, then the GATK will use those. RECALCULATE will always recalculate the BAQ, regardless of the tag, which is useful if you are experimenting with the gap open penalty (see below).

If you are really an expert, the GATK allows you to specify the BAQ gap open penalty (-baqGOP) to use in the HMM. This value should be 40 by default, a good value for whole genomes and exomes for highly sensitive calls. However, if you are analyzing exome data only, you may want to use 30, which seems to result in more specific call set. We continue to play with these values some. Some walkers, where BAQ would corrupt their analyses, forbid the use of BAQ and will throw an exception if -baq is provided.

### 3. Some example uses of the BAQ in the GATK

• For UnifiedGenotyper to get more specific SNP calls.

• For PrintReads to write out a BAM file with BAQ tagged reads

• For TableRecalibrator or IndelRealigner to write out a BAM file with BAQ tagged reads. Make sure you use -baq RECALCULATE so the engine knows to recalculate the BAQ after these tools have updated the base quality scores or the read alignments. Note that both of these tools will not use the BAQ values on input, but will write out the tags for analysis tools that will use them.

Note that some tools should not have BAQ applied to them.

This last option will be a particularly useful for people who are already doing base quality score recalibration. Suppose I have a pipeline that does:

RealignerTargetCreator
IndelRealigner

BaseRecalibrator

UnifiedGenotyper


A highly efficient BAQ extended pipeline would look like

RealignerTargetCreator
IndelRealigner // don't bother with BAQ here, since we will calculate it in table recalibrator

BaseRecalibrator
PrintReads (with --BQSR input) -baq RECALCULATE // now the reads will have a BAQ tag added.  Slows the tool down some

UnifiedGenotyper -baq CALCULATE_AS_NECESSARY // UG will use the tags from TableRecalibrate, keeping UG fast


### 4. BAQ and walker control

Walkers can control via the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.

## Example and Command Line Arguments

For a complete, detailed argument reference, refer to the GATK document page here

## Introduction

The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.

The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.

The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1
chr20   332341  rs6076509   A   G   470.60  PASS    AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57    GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99
chr20   332503  rs6133033   C   G   726.23  PASS    AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75    GT:DP:GL:GQ:PQ  1|0:60:-110.83,-18.08,-149.73:99:126.93


The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. Note that for cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.

Limitations:

• ReadBackedPhasing doesn't currently support insertions, deletions, or multi-nucleotide polymorphisms.
• Input VCF files should only be for diploid organisms.

## More detailed aspects of semantics of phasing in the VCF format

• The "|" symbol is used for each sample to indicate that each of the alleles of the genotype in question derive from the same haplotype as each of the alleles of the genotype of the same sample in the previous NON-FILTERED variant record. That is, rows without FILTER=PASS are essentially ignored in the read-backed phasing (RBP) algorithm.
• Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.
• A homozygous genotype is always "appended" to the preceding haplotype. For example, any 0/0 or 1/1 record is always converted into 0|0 and 1|1.
• RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample. If there is sufficient read information to deduce the two haplotypes (for that sample), then the current genotype is declared phased ("/" changed to "|") and assigned a PQ that is proportional to the estimated Phred-scaled error rate. All homozygous genotypes for that sample that lie in between the two heterozygous genotypes are also assigned the same PQ value (and remain phased).
• If RBP cannot phase the heterozygous genotype, then the genotype remains with a "/", and no PQ score is assigned. This site essentially starts a new section of haplotype for this sample.

For example, consider the following records from the VCF file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1   SAMP2
chr1    1   .   A   G   99  PASS    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    2   .   A   G   99  PASS    .   GT:GL:GQ:PQ 1|1:-100,0,-100:99:60   0|1:-100,0,-100:99:50
chr1    3   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:60   0|0:-100,0,-100:99:60
chr1    4   .   A   G   99  FAIL    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    5   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:70   1|0:-100,0,-100:99:60
chr1    6   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0/1:-100,0,-100:99  1|1:-100,0,-100:99:70
chr1    7   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:80   0|1:-100,0,-100:99:70
chr1    8   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:90   0|1:-100,0,-100:99:80


The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:

1. AGAAA
2. GGGAG

And two haplotypes at positions 6-8:

1. AAA
2. GGG

And, SAMP2 has the two haplotypes at positions 1-8:

1. AAAAGGAA
2. GGAAAGGG
• Note that we have excluded the non-PASS SNP call (at chr1:4), thus assuming that both samples are homozygous reference at that site.

# What is a synthetic read?

When running reduce reads, the algorithm will find regions of low variation in the genome and compress them together. To represent this compressed region, we use a synthetic read that carries all the information necessary to downstream tools to perform likelihood calculations over the reduced data.

They are called Synthetic because they are not read by a sequencer, these reads are automatically generated by the GATK and can be extremely long. In a synthetic read, each base will represent the consensus base for that genomic location. Each base will have it's consensus quality score represented in the equivalent offset in the quality score string.

## Consensus Bases

ReduceReads has several filtering parameters for consensus regions. Consensus is created based on base qualities, mapping qualities and other adjustable parameters from the command line. All filters are described in the technical documentation of reduce reads.

## Consensus Quality Scores

The consensus quality score of a consensus base is essentially the mean of all bases that passed all the filters and represent an observation of that base. It is represented in the quality score field of the SAM format.

n is the number of bases that contributed to the consensus base and q_i is the corresponding quality score of each base.

Insertion quality scores and Deletion quality scores (generated by BQSR) will undergo the same process and will be represented the same way.

## Mapping Quality

The mapping quality of a synthetic read is a value representative of the mapping qualities of all the reads that contributed to it. This is an average of the root mean square of the mapping quality of all reads that contributed to the bases of the synthetic read. It is represented in the mapping quality score field of the SAM format.

where n is the number of reads and x_i is the mapping quality of each read.

## Original Alignments

A synthetic read may come with up to two extra tags representing its original alignment information. Due to many filters in ReduceReads, reads are hard-clipped to the are of interest. These hard-clips are always represented in the cigar string with the H element and the length of the clipping in genomic coordinates. Sometimes hard clipping will make it impossible to retrieve what was the original alignment start / end of a read. In those cases, the read will contain extra tags with integer values representing their original alignment start or end.

Here are the two integer tags:

• OP -- original alignment start
• OE -- original alignment end

For all other reads, where this can still be obtained through the cigar string (i.e. using getAlignmentStart() or getUnclippedStart()), these tags are not created.

## The RR Tag

the RR tag is a tag that holds the observed depth (after filters) of every base that contributed to a reduce read. That means all bases that passed the mapping and base quality filters, and had the same observation as the one in the reduced read.

The RR tag carries an array of bytes and for increased compression, it works like this: the first number represents the depth of the first base in the reduced read. all subsequent numbers will represent the offset depth from the first base. Therefore, to calculate the depth of base "i" using the RR array, one must use :

RR[0] + RR[i]

but make sure i > 0. Here is the code we use to return the depth of the i'th base:

return (i==0) ? firstCount : (byte) Math.min(firstCount + offsetCount, Byte.MAX_VALUE);

# Using Synthetic Reads with GATK tools

The GATK is 100% compatible with synthetic reads. You can use Reduced BAM files in combination with non-reduced BAM files in any GATK analysis tools and it will work seamlessly.

# Programming in the GATK

If you are programming using the GATK framework, the GATKSAMRecord class carries all the necessary functionality to use synthetic reads transparently with methods like:

• public final byte getReducedCount(final int i)
• public int getOriginalAlignmentStart()
• public int getOriginalAlignmentEnd()

This script can be used for sorting an input file based on a reference.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

sub usage {

print "\nUsage:\n";
print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";

print " Sorts lines of the input file INFILE according\n";
print " to the reference contig order specified by the\n";
print " reference dictionary REF_DICT (.fai file).\n";
print " The sort is stable. If -k option is not specified,\n";
print " it is assumed that the contig name is the first\n";
print " field in each line.\n\n";
print "  INPUT      input file to sort. If '-' is specified, \n";
print "             then reads from STDIN.\n";
print "  REF_DICT   .fai file, or ANY file that has contigs, in the\n";
print "             desired soting order, as its first column.\n";
print "  --k POS :  contig name is in the field POS (1-based)\n";
print "             of input lines.\n\n";

exit(1);
}

my $pos = 1; GetOptions( "k:i" => \$pos );

$pos--; usage() if ( scalar(@ARGV) == 0 ); if ( scalar(@ARGV) != 2 ) { print "Wrong number of arguments\n"; usage(); } my$input_file = $ARGV[0]; my$dict_file = $ARGV[1]; open(DICT, "<$dict_file") or die("Can not open $dict_file:$!");

my %ref_order;

my $n = 0; while ( <DICT> ) { chomp; my ($contig, $rest) = split "\t"; die("Dictionary file is probably corrupt: multiple instances of contig$contig") if ( defined $ref_order{$contig} );

$ref_order{$contig} = $n;$n++;
}

close DICT;
#we have loaded contig ordering now

my $INPUT; if ($input_file eq "-" ) {
$INPUT = "STDIN"; } else { open($INPUT, "< $input_file") or die("Can not open$input_file: $!"); } my %temp_outputs; while ( <$INPUT> ) {

my @fields = split '\s';
die("Specified field position exceeds the number of fields:\n$_") if ($pos >= scalar(@fields) );

my $contig =$fields[$pos]; if ($contig =~ m/:/ ) {
my @loc = split(/:/, $contig); # print$contig . " " . $loc[0] . "\n";$contig = $loc[0] } chomp$contig if ( $pos == scalar(@fields) - 1 ); # if last field in line my$order;
if ( defined $ref_order{$contig} ) { $order =$ref_order{$contig}; } else {$order = $n; # input line has contig that was not in the dict;$n++; # this contig will go at the end of the output,
# after all known contigs
}

my $fhandle; if ( defined$temp_outputs{$order} ) {$fhandle = $temp_outputs{$order} }
else {
#print "opening $order $$_\n"; open( fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
die ( "Can not open temporary file $order:$!");
$temp_outputs{$order} = $fhandle; } # we got the handle to the temp file that keeps all # lines with contig$contig

print $fhandle$_; # send current line to its corresponding temp file
}

close $INPUT; foreach my$f ( values %temp_outputs ) { close $f; } # now collect back into single output stream: for ( my$i = 0 ; $i <$n ; $i++ ) { # if we did not have any lines on contig$i, then there's
# no temp file and nothing to do
next if ( ! defined $temp_outputs{$i} ) ;

my $f; open ($f, "< /tmp/sortByRef.$$.i.tmp" ); while ( <f> ) { print ; } close f; unlink "/tmp/sortByRef.$$.\$i.tmp";

}


## General overview

This document describes the statistical methods used by various GATK tools, especially the variant annotations. The methods are arranged alphabetically below.

Note that this document is in progress; we will be adding more over time. Please check back later or subscribe to forum notifications to receive an email when we update the content in this section. Feel free to leave a comment saying which stats methods used by GATK you'd like to see documented here.

### List of documented methods

• Rank Sum Test

## Rank Sum Test

### Overview

The Rank Sum Test, also known as Mann-Whitney-Wilcoxon U-test after its developers (who are variously credited in subsets and in different orders depending on the sources you read) is a statistical test that aims to determine whether there is significant difference in the values of two populations of data.

### Theory

The Wikipedia article about the Rank Sum Test includes some very helpful information on the theoretical underpinnings of the test, as well as various examples of how it can be applied.

### Use in GATK

This test is used by several GATK annotations, including two standard annotations that are used for variant recalibration in the Best Practices: MappingQualityRankSum and ReadPosRankSum. In all cases, the idea is to check, for a given candidate variant, whether the properties of the data that support the reference allele are similar to those of the data that support a variant allele. If they are not similar, we conclude that there may be some technical bias and that the candidate variant may be an artifact.

### Example: BaseQualityRankSumTest

Note: this example applies Method 2 from the Wikipedia article linked above.

In this example, we have a set of 20 reads, 10 of which support the reference allele and 10 of which support the alternate allele. At first glance, that looks like a clear heterozygous 0/1 site. But to be thorough in our analysis and to account for any technical bias, we want to determine if there is a significant difference in the base qualities of the bases that support the reference allele vs. the bases that support the alternate allele.

Before we proceed, we must define our null hypothesis and alternate hypothesis.

-Null hypothesis: There is no difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

-Alternate hypothesis: There is a difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

#### Step 1: List the relevant observations

Reference allele base qualities: 20, 25, 26, 30, 32, 40, 47, 50, 53, 60 Alternate allele base qualities: 0, 7, 10, 17, 20, 21, 30, 34, 40, 45

#### Step 2: Rank the observations

First, we arrange all the observations (base qualities) into a list of values ordered from lowest to highest (reference bases are in bold).

0, 7, 10, 17, 20, 20, 21, 25, 26, 30, 30, 32, 34, 40, 40, 45, 47, 50, 53, 60

Next we determine the ranks of the values. Since there are 20 observations (the base qualities), we have 20 ranks to assign. Whenever there are ties between observations for the rank, we take the rank to be equal to the midpoint of the ranks. For example, for 20(ref) and 20(alt), we have a tie in values, so we assign each observation a rank of (5+6)/2 = 5.5.

The ranks from the above list are (reference ranks are in bold):

1, 2, 3, 4, 5.5, 5.5, 7, 8, 9, 10.5, 10.5, 12, 13, 14.5, 14.5, 16, 17, 18, 19, 20

#### Step 3: Add up the ranks for each group

We now need to add up the ranks for the base qualities that came from the reference allele and the alternate allele.

$$Rank_{ref} = 133.5$$ $$Rank_{alt} = 76.5$$

#### Step 4: Calculate U for each group

U is a statistic that tells us the difference between the two rank totals. We can use the U statistic to calculate the z-score (explained below), which will give us our p-value.

Calculate U for each group (n = number of observations in each sample)

$$U_{ref} = \frac{ n_{ref} * n_{alt} + n_{ref} * (n_{ref}+ 1) }{ 2 } - Rank_{ref}$$ $$U_{alt} = \frac{ n_{alt} * n_{ref} + n_{alt} * (n_{alt} + 1) }{ 2 } - Rank_{alt}$$

$$U_{ref} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 133.5 = 21.5$$ $$U_{alt} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 76.5 = 78.5$$

#### Step 5: Calculate the overall z-score

Next, we need to calculate the z-score which will allow us to get the p-value. The z-score is a normalized score that allows us to compare the probability of the U score occurring in our distribution. https://statistics.laerd.com/statistical-guides/standard-score.php

The equation to get the z-score is:

$$z = \frac{U - mu}{u}$$

Breaking this equation down:

$$z = z-score$$ $$U = \text{lowest of the U scores calculated in previous steps}$$

$$mu = \text{mean of the U scores above} = \frac{ n_{ref} * n_{alt} }{ 2 }$$

$$u = \text{standard deviation of U} = \sqrt{ \frac{n_{ref} * n_{alt} * (n_{ref} + n_{alt} + 1) }{ 12 } }$$

To calculate our z:

$$U = 21.5$$ $$mu = \frac{10 * 10 }{ 2 } = 50$$ $$u = \sqrt{ \frac{10 * 10 *(10 + 10 + 1) }{ 12 } } = 13.229$$

So altogether we have:

$$z = \frac{ 21.5 - 50 }{ 13.229 } = -2.154$$

#### Step 6: Calculate and interpret the p-value

The p-value is the probability of obtaining a z-score at least as extreme as the one we got, assuming the null hypothesis is true. In our example, the p-value gives us the probability that there is no difference in the base qualities that support the reference allele and the base qualities that support the alternate allele. The lower the p-value, the less likely it is that there is no difference in the base qualities.

Going to the z-score table, or just using a p-value calculator, we find the p-value to be 0.0312.

This means there is a .0312 chance that the base quality scores of the reference allele and alternate allele are the same. Assuming a p-value cutoff of 0.05, meaning there is less than 5% chance there is no difference in the two groups, and greater than or equal to 95% chance that there is a difference between the two groups, we have enough evidence to reject our null hypothesis that there is no difference in the base qualities of the reference and alternate allele. This indicates there is some bias and that the alternate allele is less well supported by the data than the allele counts suggest.

This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)

For a complete, detailed argument reference, refer to the GATK document page here.

### 2. Logic for merging records across VCFs

CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

### 3. Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.

### 4. Understanding the set attribute

The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

• set=Intersection : occurred in both call sets, not filtered out

• set=NAME : occurred in the call set NAME only

• set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

• set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

### 5. Changing the set key

You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.

### 6. Minimal VCF output

Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO

An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

### 7. Combining Variant Calls with a minimum set of input sites

Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).

### 8. Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf


This results in two vcf files, which look like:

==> union.vcf <==
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

==> intersect.vcf <==
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection


#### See this announcement regarding our plans for support of DepthOfCoverage and DiagnoseTargets. If you find that there are functionalities missing in either tool, leave us a comment and we will consider adding them.

For a complete, detailed argument reference, refer to the GATK document page here.

### Introduction

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

• Total, mean, median, and quartiles for each partition type: aggregate
• Total, mean, median, and quartiles for each partition type: for each interval
• A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
• A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
• A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
• A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.

For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

### Coverage by Gene

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt


The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,


If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt


If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     &gt;500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290


Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

### 1. About the RefSeq Format

From the NCBI RefSeq website

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.


### 2. In the GATK

The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.

### 3. Generating RefSeq files

Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:

clade:    Mammal
genome:   Human
assembly: ''choose the appropriate assembly for the reference you're using''
group:    Genes abd Gene Prediction Tracks
track:    RefSeq Genes
table:    refGene
region:   ''choose the genome option''


Choose a good output filename, something like geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order. This can be done with a combination of grep, sort, and a script called sortByRef.pl that is available here.

### 4. Running with the GATK

You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:

-[arg]:REFSEQ /path/to/refSeq


Using the filename from above.

#### Warning:

The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromsome one:
Chr1:0-19 (UCSC system)
Chr1:1-19 (GATK)


All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

Introduction

SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.

Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in --variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.

### Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

### How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50


In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30


The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.


with AN=2, AC=1, AF=0.5, and DP=20.

### Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942


Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

• isVariant => is there an ALT allele?

• isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942


If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

### Known issues

Some VCFs may have repeated header entries with the same key name, for instance:

##fileformat=VCFv3.3
##FILTER=ABFilter,&quot;AB &gt; 0.75&quot;
##FILTER=HRunFilter,&quot;HRun &gt; 3.0&quot;
##FILTER=QDFilter,&quot;QD &lt; 5.0&quot;
##UG_bam_file_used=file1.bam
##UG_bam_file_used=file2.bam
##UG_bam_file_used=file3.bam
##UG_bam_file_used=file4.bam
##UG_bam_file_used=file5.bam
##source=UnifiedGenotyper
##source=VariantFiltration
##source=AnnotateVCFwithMAF
...


Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.

For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.

2 SNPs with significant strand bias

Several SNPs with excessive coverage

For a complete, detailed argument reference, refer to the GATK document page here.

### Introduction

In addition to true variation, variant callers emit a number of false-positives. Some of these false-positives can be detected and rejected by various statistical tests. VariantAnnotator provides a way of annotating variant calls as preparation for executing these tests.

Description of the haplotype score annotation

### Examples of Available Annotations

The list below is not comprehensive. Please use the --list argument to get a list of all possible annotations available. Also, see the FAQ article on understanding the Unified Genotyper's VCF files for a description of some of the more standard annotations.

Note that technically the VariantAnnotator does not require reads (from a BAM file) to run; if no reads are provided, only those Annotations which don't use reads (e.g. Chromosome Counts) will be added. But most Annotations do require reads. When running the tool we recommend that you add the -L argument with the variant rod to your command line for efficiency and speed.

## VariantFiltration

For a complete, detailed argument reference, refer to the GATK document page here.

The documentation for Using JEXL expressions within the GATK contains very important information about limitations of the filtering that can be done; in particular please note the section on working with complex expressions.

## Filtering Individual Genotypes

One can now filter individual samples/genotypes in a VCF based on information from the FORMAT field: Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, one can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods so that one can now filter out hets (isHet == 1), refs (isHomRef == 1), or homs (isHomVar == 1).

For a complete, detailed argument reference, refer to the technical documentation page.

## Modules

You can find detailed information about the various modules here.

### Stratification modules

• AlleleFrequency
• AlleleCount
• CompRod
• Contig
• CpG
• Degeneracy
• EvalRod
• Filter
• FunctionalClass
• JexlExpression
• Novelty
• Sample

### Evaluation modules

• CompOverlap
• CountVariants

Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).

## A useful analysis using VariantEval

We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.

#### Combine the VCFs

java -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T CombineVariants \
-V:FOO foo.vcf \
-V:BAR bar.vcf \
-priority FOO,BAR \
-o merged.vcf


#### Run VariantEval

java -jar GenomeAnalysisTK.jar \
-T VariantEval \
-R ref.fasta \
-D dbsnp.vcf \
-select 'set=="Intersection"' -selectName Intersection \
-select 'set=="FOO"' -selectName FOO \
-select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \
-select 'set=="BAR"' -selectName BAR \
-select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \
-select 'set=="FilteredInAll"' -selectName FilteredInAll \
-o merged.eval.gatkreport \
-eval merged.vcf \
-l INFO


### Checking the possible values of 'set'

It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:

java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt
grep -v 'set' fields.txt | sort | uniq -c


This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.

## Reading the VariantEval output file

The VariantEval output is formatted as a GATKReport.

### Understanding Genotype Concordance values from Variant Eval

The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.

##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats&#160;: the concordance statistics summary for each sample
GenotypeConcordance.sampleSummaryStats  CompRod   CpG      EvalRod  JexlExpression  Novelty  percent_comp_ref_called_var  percent_comp_het_called_het  percent_comp_het_called_var  percent_comp_hom_called_hom  percent_comp_hom_called_var  percent_non-reference_sensitivity  percent_overall_genotype_concordance  percent_non-reference_discrepancy_rate
GenotypeConcordance.sampleSummaryStats  compOMNI  all      eval     none            all      0.78                         97.65                        98.39                        99.13                        99.44                        98.80                              99.09                                 3.60


The key outputs:

• percent_overall_genotype_concordance
• percent_non_ref_sensitivity_rate
• percent_non_ref_discrepancy_rate

All defined below.

### Overview

This document covers the use of various tools and metrics associated with depth of coverage.

### Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

• DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
• Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.

TO BE CONTINUED... meanwhile, check out the DepthOfCoverage and DiagnoseTargets tool docs.

Note that the Somatic Indel Detector was previously called Indel Genotyper V2.0

For a complete, detailed argument reference, refer to the GATK document page here.

### Calling strategy

The Somatic Indel Detector can be run in two modes: single sample and paired sample. In the former mode, exactly one input bam file should be given, and indels in that sample are called. In the paired mode, the calls are made in the tumor sample, but in addition to that the differential signal is sought between the two samples (e.g. somatic indels present in tumor cell DNA but not in the normal tissue DNA). In the paired mode, the genotyper makes an initial call in the tumor sample in the same way as it would in the single sample mode; the call, however, is then compared to the normal sample. If any evidence (even very weak, so that it would not trigger a call in single sample mode) for the event is found in the normal, the indel is annotated as germline. Only when the minimum required coverage in the normal sample is achieved and there is no evidence in the normal sample for the event called in the tumor is the indel annotated as somatic.

The calls in both modes (recall that in paired mode the calls are made in tumor sample only and are simply annotated according to the evidence in the matching normal) are performed based on a set of simple thresholds. Namely, all distinct events (indels) at the given site are collected, along with the respective counts of alignments (reads) supporting them. The putative call is the majority vote consensus (i.e. the indel that has the largest count of reads supporting it). This call is accepted if 1) there is enough coverage (as well as enough coverage in matching normal sample in paired mode); 2) reads supporting the consensus indel event constitute a sufficiently large fraction of the total coverage at the site; 3) reads supporting the consensus indel event constitute a sufficiently large fraction of all the reads supporting any indel at the site. See details in the Arguments section of the tool documentation.

Theoretically, the Somatic Indel Detector can be run directly on the aligned short read sequencing data. However, it does not perform any deep algorithmic tasks such as searching for misplaced indels close to a given one, or correcting read misalignments given the presence of an indel in another read, etc. Instead, it assumes that all the evidence for indels (all the reads that support it), for the presence of the matching event in normal etc is already in the input and performs simple counting. It is thus highly, HIGHLY recommended to run the Somatic Indel Detector on "cleaned" bam files, after performing Local realignment around indels.

### Output

Brief output file (specified with -bed option) will look as follows:

chr1    556817  556817  +G:3/7
chr1    3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21
chr1    3778838 3778838 +A:15/48
...


This is a .bed track that can be loaded into UCSC browser or IGV browser, the event itself and the <count of supporting reads>/<total coverage> are reported in the 'name' field of the file. The event locations on the chromosomes are 1-based, and the convention is that all events (both insertions and deletions) are assigned to the base on the reference immediately preceding the event (second column). The third column is the stop position of the event on the reference, or strictly speaking the base immediately preceding the first base on the reference after the event: the last deleted base for deletions, or the same base as the start position for insertions. For instance, the first line in the above example specifies an insertion (+G) supported by 3 reads out of 7 (i.e. total coverage at the site is 7x) that occurs immediately after genomic position chr1:556817. The next line specifies a 19 bp deletion -TTCTGGGAGCTCCTCCCCC supported by 9 reads (total coverage 21x) occuring at (after) chr1:3535035 (the first and last deleted bases are 3535035+1=3535036 and 3535054, respectively).

Note that in the paired mode all calls made in tumor (both germline and somatic) will be printed into the brief output without further annotations.

The detailed (verbose) output option is kept for backward compatibility with post-processing tools that might have been developed to work with older versions of the IndelGenotyperV2. All the information described below is now also recorded into the vcf output file, so the verbose text output is completely redundant, except for genomic annotations (if --refseq is used). Generated vcf file can be annotated separately using VCF post-processing tools.

The detailed output (-verbose) will contain additional statistics characterizing the alignments around each called event, SOMATIC/GERMLINE annotations (in paired mode), as well as genomic annotations (when --refseq is used). The verbose output lines matching the three lines from the example above could look like this (note that the long lines are wrapped here, the actual output file contains one line per event):

chr1    556817  556817  +G      N_OBS_COUNTS[C/A/T]:0/0/52      N_AV_MM[C/R]:0.00/5.27  N_AV_MAPQ[C/R]:0.00/35.17    \
N_NQS_MM_RATE[C/R]:0.00/0.08    N_NQS_AV_QUAL[C/R]:0.00/23.74   N_STRAND_COUNTS[C/C/R/R]:0/0/32/20  \
T_OBS_COUNTS[C/A/T]:3/3/7       T_AV_MM[C/R]:2.33/5.50  T_AV_MAPQ[C/R]:66.00/24.75   \
T_NQS_MM_RATE[C/R]:0.05/0.08    T_NQS_AV_QUAL[C/R]:20.26/11.61  T_STRAND_COUNTS[C/C/R/R]:3/0/2/2 \
SOMATIC GENOMIC
chr1    3535035 3535054 -TTCTGGGAGCTCCTCCCCC  N_OBS_COUNTS[C/A/T]:3/3/6 N_AV_MM[C/R]:3.33/2.67  N_AV_MAPQ[C/R]:73.33/99.00 \
N_NQS_MM_RATE[C/R]:0.00/0.00    N_NQS_AV_QUAL[C/R]:29.27/31.83  N_STRAND_COUNTS[C/C/R/R]:0/3/0/3  \
T_OBS_COUNTS[C/A/T]:9/9/21      T_AV_MM[C/R]:1.56/0.17  T_AV_MAPQ[C/R]:88.00/99.00    \
T_NQS_MM_RATE[C/R]:0.02/0.00   T_NQS_AV_QUAL[C/R]:30.86/25.25  T_STRAND_COUNTS[C/C/R/R]:2/7/2/10  \
GERMLINE  UTR TPRG1L
chr1    3778838 3778838 +A      N_OBS_COUNTS[C/A/T]:5/7/22      N_AV_MM[C/R]:5.00/5.20  N_AV_MAPQ[C/R]:54.20/81.20  \
N_NQS_MM_RATE[C/R]:0.00/0.01    N_NQS_AV_QUAL[C/R]:24.94/26.05  N_STRAND_COUNTS[C/C/R/R]:4/1/15/0  \
T_OBS_COUNTS[C/A/T]:15/15/48    T_AV_MM[C/R]:9.73/4.21  T_AV_MAPQ[C/R]:91.53/86.09  \
T_NQS_MM_RATE[C/R]:0.17/0.02    T_NQS_AV_QUAL[C/R]:30.57/25.19  T_STRAND_COUNTS[C/C/R/R]:15/0/32/1 \
GERMLINE   INTRON  DFFB


The fields are tab-separated. The first four fields confer the same event and location information as in the brief format (chromosome, last reference base before the event, last reference base of the event, event itself). Event information is followed by tagged fields reporting various collected statistics. In the paired mode (as in the example shown above), there will be two sets of the same statistics, one for normal (prefixed with 'N_') and one for tumor (prefixed with 'T_') samples. In the single sample mode, there will be only one set of statistics (for the only sample analyzed) and no 'N_'/'T_' prefixes. Statistics are stratified into (two or more of) the following classes: (C)onsensus-supporting reads (i.e. the reads that contain the called event, for which the line is printed); (A)ll reads that contain an indel at the site (not necessarily the called consensus); (R)eference allele-supporting reads, (T)otal=all reads.

For instance, the field T_OBS_COUNTS[C/A/T]:3/3/7 in the first line of the example above should be interpreted as follows: a) this is the OBS_COUNTS statistics for the (T)umor sample (this particular one is simply the read counts, all statistics are listed below); b) The statistics is broken down into three classes: [C/A/T]=(C)onsensus/(A)ll-indel/(T)otal coverage; c) the respective values in each class are 3, 3, 7. In other words, the insertion +G is observed in 3 distinct reads, there was a total of 3 reads with an indel at the site (i.e. only consensus was observed in this case with no observations for any other indel event), and the total coverage at the site is 7. Examining the N_OBS_COUNTS field in the same record, we can conclude that the total coverage in normal at the same site was 52, and among those reads there was not a single one carrying any indel (C/A/T=0/0/52). Hence the 'SOMATIC' annotation added towards the end of the line.

In paired mode the tagged statistics fields are always followed by GERMLINE/SOMATIC annotation (in single sample mode this field is skipped). If --refseq option is used, the next field will contain the coding status annotation (one of GENOMIC/INTRON/UTR/CODING), optionally followed by the gene name (present if the indel is within the boundaries of an annotated gene, i.e. the status is not GENOMIC).

#### List of annotations produced in verbose mode

NOTE: in older versions the OBS_COUNTS statistics was erroneously annotated as [C/A/R] (last class R, not T). This was a typo, and the last number reported in the triplet was still total coverage.

Duplicated reads, reads with mapping quality 0, or reads coming from blacklisted lanes are not counted and do not contribute to any of the statistics.

When no reads are available in a class (e.g. the count of consensus indel-supporting reads in normal sample is 0), all the other statistics for that class (e.g. average mismatches per read, average base qualities in NQS window etc) will be set to 0. For some statistics (average number of mismatches) this artificial value can be "very good", for some others (average base quality) it's "very bad". Needless to say, all those zeroes reported for the classes with no reads should be ignored when attempting call filtering.

• OBS_COUNTS[C/A/T] Observed counts of reads supporting the consensus (called) indel, all indels (consensus + any others), and the total coverage at the site, respectively.
• AV_MM[C/R] Average numbers of mismatches across consensus indel- and reference allele-supporting reads.
• AV_MAPQ[C/R] Average mapping qualities (as reported in the input bam file) of consensus indel- and reference allele-supporting reads.
• NQS_AV_QUAL[C/R] Average base quality computed across all bases falling into the 5bp window on each side of the indel and coming form all consensus- or reference-supporting reads, respectively.
• STRAND_COUNTS[C/C/R/R] Counts of consensus-supporting forward aligned, consensus-supporting rc-aligned, reference-supporting forward-aligned and reference-supporting rc-aligned reads, respectively.

### Creating a indel mask file

The output of the Somatic Indel Detector can be used to mask out SNPs near indels. To do this, we have a script that creates a bed file representing the masking intervals based on the output of this tool. Note that this script requires a full SVN checkout of the GATK, although the strategy is simple: for each indel, create an interval which extends N bases to either side of it.

python python/makeIndelMask.py <raw_indels> <mask_window> <output>
e.g.


For a complete, detailed argument reference, refer to the technical documentation page.

### 1. Slides

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see [Preparing the essential GATK input files: the reference genome] for more information on preparing FASTA reference sequences for use with the GATK.

### 2. Relatively Recent Changes

The Unified Genotyper now makes multi-allelic variant calls!

#### Fragment-based calling

The Unified Genotyper calls SNPs via a two-stage inference, first from the reads to the sequenced fragments, and then from these inferred fragments to the chromosomal sequence of the organism. This two-stage system properly handles the correlation of errors between read pairs when the sequenced fragments contains errors itself. See Fragment-based calling PDF for more details and analysis.

#### The Allele Frequency Calculation

The allele frequency calculation model used by the Unified Genotyper computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools' mpileup tool. Heng Li has graciously allowed us to post the mathematical calculations backing the EXACT model here. Note that the calculations in the provided document assume just a single alternate allele for simplicity, whereas the Unified Genotyper has been extended to handle genotyping multi-allelic events. A slide showing the mathematical details for multi-allelic calling is available here.

### 3. Indel Calling with the Unified Genotyper

While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).

Note also that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.

Finally, while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example, --min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.

### 4. Miscellaneous notes

Note that the Unified Genotyper will not call indels in 454 data!

It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console.

You can turn off logging completely by setting -l OFF so that the GATK operates in silent mode.

By default the Unified Genotyper downsamples each sample's coverage to no more than 250x (so there will be at most 250 * number_of_samples reads at a site). Unless there is a good reason for wanting to change this value, we suggest using this default value especially for exome processing; allowing too much coverage will require a lot more memory to run. When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this value to about 10 times the average coverage: -dcov 40.

The Unified Genotyper does not use reads with a mapping quality of 255 ("unknown quality" according to the SAM specification). This filtering is enforced because the genotyper caps a base's quality by the mapping quality of its read (since the probability of the base's being correct depends on both qualities). We rely on sensible values for the mapping quality and therefore using reads with a 255 mapping quality is dangerous.

• That being said, if you are working with a data type where alignment quality cannot be determined, there is a (completely unsupported) workaround: the ReassignMappingQuality filter enables you to reassign the mapping quality of all reads on the fly. For example, adding -rf ReassignMappingQuality -DMQ 60 to your command-line would change all mapping qualities in your bam to 60.

• Or, if you are working with data from a program like TopHat which uses MAPQ 255 to convey meaningful information, you can use the ReassignOneMappingQuality filter (new in 2.4) to assign a different MAPQ value to those reads so they won't be ignored by GATK tools. For example, adding -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 would change the mapping qualities of reads with MAPQ 255 in your bam to MAPQ 60.

### 5. Explanation of callable base counts

At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:

INFO  00:23:29,795 UnifiedGenotyper - Visited bases
247249719
INFO  00:23:29,796 UnifiedGenotyper - Callable bases
219998386
INFO  00:23:29,796 UnifiedGenotyper - Confidently called bases
219936125
INFO  00:23:29,796 UnifiedGenotyper - % callable bases of all loci
88.978
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci
88.953
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci
88.953
INFO  00:23:29,797 UnifiedGenotyper - Actual calls made
303126


This is what these lines mean:

• Visited bases

This the total number of reference bases that were visited.

• Callable bases

Visited bases minus reference Ns and places with no coverage, which we never try to call.

• Confidently called bases

Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and/or QUAL > T for the site being non-reference in at least one sample.

Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.

Note also that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.

### 6. Calling sex chromosomes

The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in the GSA Public Drop Box.

Our general approach to calling on X and Y is to treat them just as we do the autosomes and then applying a gender-aware tools to correct the genotypes afterwards. It makes sense to filter out sites across all samples (outside PAR) that appear as confidently het in males, as well as sites on Y that appear confidently non-reference in females. Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options. We applied this approach in 1000G, but we only did it as the data went into imputation, so there's no simple tool to do this, unfortunately. The GATK team is quite interested in a general sex correction tool (analogous to the PhaseByTransmission tool for trios), so please do contact us if you are interested in contributing such a tool to the GATK codebase.

### 7. Related materials

• Explanation of the VCF Output

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.

### Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

• VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

• ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

#### How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

#### How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

## Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

### Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (from the bottom, with the highest value of truth sensitivity but usually the lowest values of novel Ti/Tv) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels.

### Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

• The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
• The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
• The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

### Finally, a couple of Frequently Asked Questions

#### - Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians` is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

#### - Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.