# IndelRealigner

Performs local realignment of reads to correct misalignments due to the presence of indels.

## Overview

The local realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual's genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment. Local realignment serves to transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches. Unlike most mappers, this walker uses the full alignment context to determine whether an appropriate alternate reference (i.e. indel) exists. Following local realignment, the GATK tool Unified Genotyper can be used to sensitively and specifically identify indels.

There are 2 steps to the realignment process:
1. Determining (small) suspicious intervals which are likely in need of realignment (see the RealignerTargetCreator tool)
2. Running the realigner over those intervals (IndelRealigner)

### Input

One or more aligned BAM files and optionally one or more lists of known indels.

### Output

A realigned version of your input BAM file(s).

### Example

 java -Xmx4g -jar GenomeAnalysisTK.jar \
-T IndelRealigner \
-R ref.fasta \
-I input.bam \
-targetIntervals intervalListFromRTC.intervals \
-o realignedBam.bam \
[-known /path/to/indels.vcf] \
[-compress 0]    (this argument recommended to speed up the process *if* this is only a temporary file; otherwise, use the default value)


### Caveats

• An important note: the input bam(s), reference, and known indel file(s) should be the same ones used for the RealignerTargetCreator step.
• Another important note: because reads produced from the 454 technology inherently contain false indels, the realigner will not currently work with them (or with reads from similar technologies).

This Read Filter is automatically applied to the data by the Engine before processing by IndelRealigner.

### Downsampling settings

This tool does not apply any downsampling by default.

## Command-line Arguments

### Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

### IndelRealigner specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--targetIntervals
NA Intervals file output from RealignerTargetCreator
Optional Inputs
--knownAlleles
-known
[] Input VCF file(s) with known indels
Optional Outputs
--out
-o
NA Output bam
Optional Parameters
--consensusDeterminationModel
-model
USE_READS Determines how to compute the possible alternate consenses
--LODThresholdForCleaning
-LOD
5.0 LOD threshold above which the cleaner will clean
--nWayOut
NA Generate one output file for each input (-I) bam file (not compatible with -output)
--entropyThreshold
-entropy
0.15 Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0)
--maxConsensuses
30 Max alternate consensuses to try (necessary to improve performance in deep coverage)
--maxIsizeForMovement
-maxIsize
3000 maximum insert size of read pairs that we attempt to realign
--maxPositionalMoveAllowed
-maxPosMove
200 Maximum positional move in basepairs that a read can be adjusted during realignment
-greedy
120 Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)
20000 Max reads allowed at an interval for realignment
-maxInMemory
150000 max reads allowed to be kept in memory at a time by the SAMFileWriter
--noOriginalAlignmentTags
-noTags
false Don't output the original cigar or alignment start tags for each realigned read in the output bam

### Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

### --consensusDeterminationModel / -model

Determines how to compute the possible alternate consenses
We recommend that users run with USE_READS when trying to realign high quality longer read data mapped with a gapped aligner; Smith-Waterman is really only necessary when using an ungapped aligner (e.g. MAQ in the case of single-end read data).

The --consensusDeterminationModel argument is an enumerated type (ConsensusDeterminationModel), which can have one of the following values:

KNOWNS_ONLY
Uses only indels from a provided ROD of known indels.
USE_SW
Additionally uses 'Smith-Waterman' to generate alternate consenses.

### --entropyThreshold / -entropy

Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0)
For expert users only! This is similar to the argument in the RealignerTargetCreator walker. The point here is that the realigner will only proceed with the realignment (even above the given threshold) if it minimizes entropy among the reads (and doesn't simply push the mismatch column to another position). This parameter is just a heuristic and should be adjusted based on your particular data set.

double  0.15  [ [ -∞  ∞ ] ]

### --knownAlleles / -known

Input VCF file(s) with known indels
Any number of VCF files representing known indels to be used for constructing alternate consenses. Could be e.g. dbSNP and/or official 1000 Genomes indel calls. Non-indel variants in these files will be ignored.

--knownAlleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

List[RodBinding[VariantContext]]  []

### --LODThresholdForCleaning / -LOD

LOD threshold above which the cleaner will clean
This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment? Note that this number should be adjusted based on your particular data set. For low coverage and/or when looking for indels with low allele frequency, this number should be smaller.

double  5.0  [ [ -∞  ∞ ] ]

### --maxConsensuses / -maxConsensuses

Max alternate consensuses to try (necessary to improve performance in deep coverage)
For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.

int  30  [ [ -∞  ∞ ] ]

### --maxIsizeForMovement / -maxIsize

maximum insert size of read pairs that we attempt to realign
For expert users only!

int  3000  [ [ -∞  ∞ ] ]

### --maxPositionalMoveAllowed / -maxPosMove

Maximum positional move in basepairs that a read can be adjusted during realignment
For expert users only!

int  200  [ [ -∞  ∞ ] ]

Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)
For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.

int  120  [ [ -∞  ∞ ] ]

Max reads allowed at an interval for realignment
For expert users only! If this value is exceeded at a given interval, realignment is not attempted and the reads are passed to the output file(s) as-is. If you need to allow more reads (e.g. with very deep coverage) regardless of memory, use a higher number.

int  20000  [ [ -∞  ∞ ] ]

max reads allowed to be kept in memory at a time by the SAMFileWriter
For expert users only! To minimize memory consumption you can lower this number (but then the tool may skip realignment on regions with too much coverage; and if the number is too low, it may generate errors during realignment). Just make sure to give Java enough memory! 4Gb should be enough with the default value.

int  150000  [ [ -∞  ∞ ] ]

### --noOriginalAlignmentTags / -noTags

Don't output the original cigar or alignment start tags for each realigned read in the output bam

boolean  false

### --nWayOut / -nWayOut

Generate one output file for each input (-I) bam file (not compatible with -output)
Reads from all input files will be realigned together, but then each read will be saved in the output file corresponding to the input file that the read came from. There are two ways to generate output bam file names: 1) if the value of this argument is a general string (e.g. '.cleaned.bam'), then extensions (".bam" or ".sam") will be stripped from the input file names and the provided string value will be pasted on instead; 2) if the value ends with a '.map' (e.g. input_output.map), then the two-column tab-separated file with the specified name must exist and list unique output file name (2nd column) for each input file name (1st column). Note that some GATK arguments do NOT work in conjunction with nWayOut (e.g. --disable_bam_indexing).

String

### --out / -o

Output bam
The realigned bam file.

GATKSAMFileWriter

### --targetIntervals / -targetIntervals

Intervals file output from RealignerTargetCreator
The interval list output from the RealignerTargetCreator tool using the same bam(s), reference, and known indel file(s).

R IntervalBinding[Feature]

GATK version 3.2-2-gec30cee built at 2014/09/12 22:29:29. GTD: NA