Performs local realignment of reads to correct misalignments due to the presence of indels.
The local realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual's genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment. Local realignment serves to transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches. Unlike most mappers, this walker uses the full alignment context to determine whether an appropriate alternate reference (i.e. indel) exists. Following local realignment, the GATK tool Unified Genotyper can be used to sensitively and specifically identify indels.
For more details, see http://www.broadinstitute.org/gatk/guide/article?id=38
One or more aligned BAM files and optionally one or more lists of known indels.
A realigned version of your input BAM file(s).
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T IndelRealigner \ -R ref.fasta \ -I input.bam \ -targetIntervals intervalListFromRTC.intervals \ -o realignedBam.bam \ [-known /path/to/indels.vcf] \ [-compress 0] (this argument recommended to speed up the process *if* this is only a temporary file; otherwise, use the default value)
This Read Filter is automatically applied to the data by the Engine before processing by IndelRealigner.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.
|--targetIntervals||IntervalBinding[Feature]||NA||Intervals file output from RealignerTargetCreator|
|--consensusDeterminationModel||ConsensusDeterminationModel||USE_READS||Determines how to compute the possible alternate consenses|
|--knownAlleles||List[RodBinding[VariantContext]]||||Input VCF file(s) with known indels|
|--LODThresholdForCleaning||double||5.0||LOD threshold above which the cleaner will clean|
|--nWayOut||String||NA||Generate one output file for each input (-I) bam file (not compatible with -output)|
|--entropyThreshold||double||0.15||Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0)|
|--maxConsensuses||int||30||Max alternate consensuses to try (necessary to improve performance in deep coverage)|
|--maxIsizeForMovement||int||3000||maximum insert size of read pairs that we attempt to realign|
|--maxPositionalMoveAllowed||int||200||Maximum positional move in basepairs that a read can be adjusted during realignment|
|--maxReadsForConsensuses||int||120||Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)|
|--maxReadsForRealignment||int||20000||Max reads allowed at an interval for realignment|
|--maxReadsInMemory||int||150000||max reads allowed to be kept in memory at a time by the SAMFileWriter|
|--noOriginalAlignmentTags||boolean||false||Don't output the original cigar or alignment start tags for each realigned read in the output bam|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Determines how to compute the possible alternate consenses. We recommend that users run with USE_READS when trying to realign high quality longer read data mapped with a gapped aligner;
Smith-Waterman is really only necessary when using an ungapped aligner (e.g. MAQ in the case of single-end read data).
The --consensusDeterminationModel argument is an enumerated type (ConsensusDeterminationModel), which can have one of the following values:
Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0). For expert users only! This is similar to the argument in the RealignerTargetCreator walker. The point here is that the realigner will only proceed with the realignment (even above the given threshold) if it minimizes entropy among the reads (and doesn't simply push the mismatch column to another position). This parameter is just a heuristic and should be adjusted based on your particular data set.
Input VCF file(s) with known indels. Any number of VCF files representing known indels to be used for constructing alternate consenses. Could be e.g. dbSNP and/or official 1000 Genomes indel calls. Non-indel variants in these files will be ignored. --knownAlleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
LOD threshold above which the cleaner will clean. This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment? Note that this number should be adjusted based on your particular data set. For low coverage and/or when looking for indels with low allele frequency, this number should be smaller.
Max alternate consensuses to try (necessary to improve performance in deep coverage). For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.
maximum insert size of read pairs that we attempt to realign. For expert users only!
Maximum positional move in basepairs that a read can be adjusted during realignment. For expert users only!
Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage). For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.
Max reads allowed at an interval for realignment. For expert users only! If this value is exceeded at a given interval, realignment is not attempted and the reads are passed to the output file(s) as-is. If you need to allow more reads (e.g. with very deep coverage) regardless of memory, use a higher number.
max reads allowed to be kept in memory at a time by the SAMFileWriter. For expert users only! To minimize memory consumption you can lower this number (but then the tool may skip realignment on regions with too much coverage; and if the number is too low, it may generate errors during realignment). Just make sure to give Java enough memory! 4Gb should be enough with the default value.
Don't output the original cigar or alignment start tags for each realigned read in the output bam.
Generate one output file for each input (-I) bam file (not compatible with -output). Reads from all input files will be realigned together, but then each read will be saved in the output file corresponding to the input file that the read came from. There are two ways to generate output bam file names: 1) if the value of this argument is a general string (e.g. '.cleaned.bam'), then extensions (".bam" or ".sam") will be stripped from the input file names and the provided string value will be pasted on instead; 2) if the value ends with a '.map' (e.g. input_output.map), then the two-column tab-separated file with the specified name must exist and list unique output file name (2nd column) for each input file name (1st column). Note that some GATK arguments do NOT work in conjunction with nWayOut (e.g. --disable_bam_indexing).
Output bam. The realigned bam file.
Intervals file output from RealignerTargetCreator. The interval list output from the RealignerTargetCreator tool using the same bam(s), reference, and known indel file(s).
GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.