Created 2012-07-23 23:55:56 | Updated 2012-07-23 23:55:56

A new tool has been released!

Check out the documentation at IndelRealigner.

Created 2012-07-23 16:48:55 | Updated 2012-09-30 23:35:55

## Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.

## Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.

# Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R /path/to/reference.fasta \
-o /path/to/output.intervals \
-known /path/to/indel_calls.vcf


The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
-jar /path/to/GenomeAnalysisTK.jar \
-I <lane-level.bam> \
-R <ref.fasta> \
-T IndelRealigner \
-targetIntervals <intervalListFromStep1Above.intervals> \
-o <realignedBam.bam> \
-known /path/to/indel_calls.vcf
--consensusDeterminationModel KNOWNS_ONLY \
-LOD 0.4


Created 2013-08-21 21:15:21 | Updated 2014-02-08 20:09:15

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

• Changed the underlying convention of having unstranded reduced reads; instead there are now at least 2 compressed reads at every position, one for each strand (forward and reverse). This allows us to maintain strand information that is useful for downstream filtering.
• Fixed bug where representative depths were arbitrarily being capped at 127 (instead of the expected 255).
• Fixed bug where insertions downstream of a variant region weren't triggering a stop to the compression.
• Fixed bug when using --cancer_mode where alignments were being emitted out of order (and causing the tool to fail).

## Unified Genotyper

• Added --onlyEmitSamples argument that, when provided, instructs that caller to emit only the selected samples into the VCF (even though the calling is performed over all samples present in the provided bam files).
• FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
• Added a (very) experimental argument (allSitePLs) that will have the caller emit PLs for all sites (including reference sites). Note that this does not give a fully accurate reference model because it models only SNPs. Full a proper handling of the reference model, please use the Haplotype Caller.

## Haplotype Caller

• Added a still somewhat experimental PCR indel error model to the Haplotype Caller. By default this modeling is turned on and is very useful for removing false positive indel calls associated with PCR slippage around short tandem repeats (esp. homopolymers). Users have the option (with the --pcr_indel_model argument) of turning it off or making it even more aggressive (at the expense of losing some true positives too).
• Added the ability to emit accurate likelihoods for non-variant positions (i.e. what we call a "reference model" that incorporates indels as well as SNP confidences at every position). The output format can be either a record for every position or use the gVCF style recording of blocks. See the --emitRefConfidence argument for more details; note that this replaces the use of "--output_mode EMIT_ALL_SITES" in the HaplotypeCaller.
• Improvements to the internal likelihoods that are generated by the Haplotype Caller. Specifically, this tool now uses a tri-state correction like the Unified Genotyper, corrects for overlapping read pairs (from the same underlying fragment), and does not run contamination removal (allele-biased downsampling) by default.
• Several small runtime performance improvements were added (although we are still hard at work on larger improvements that will allow calling to scale to many samples; we're just not there yet).
• Fixed bug in how adapter clipping was performed (we now clip only after reverting soft-clipped bases).
• FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
• Improved the "dangling tail" recovery in the assembly algorithm, which allows for higher sensitivity in calling variants at the edges of coverage (e.g. near the ends of targets in an exome).
• Added the ability to run allele-biased downsampling with different per-sample values like the Unified Genotyper (contributed by Yossi Farjoun).

## Variant Annotator

• Fixed bug where only the last -comp was being annotated at a site.

## Indel Realigner

• Fixed bug that arises because of secondary alignments and that was causing the tool not to update the alignment start of the mate when a read was realigned.

## Phase By Transmission

• Fixed bug where multi-allelic records were being completely dropped by this tool. Now they are emitted unphased.

## Variant Recalibrator

• General improvements to the Gaussian modeling, mostly centered around separating the parameters for the positive and negative training models.
• Added mode to not emit (at all) variant records that are filtered out.
• This tool now automatically orders the annotation dimensions by their standard deviation instead of the order they were specified on the command-line in order to stabilize the training and have it produce optimal results.
• Fixed bug where the tool occasionally produced bad log10 values internally.

## Miscellaneous

• General performance improvements to the VCF reading code contributed by Michael McCowan.
• Error messages are much less verbose and "scary."
• Fixed the ReadBackedPileup class to represent mapping qualities as ints, not (signed) bytes.
• Added the engine-wide ability to do on-the-fly BAM file sample renaming at runtime (see the documentation for the --sample_rename_mapping_file argument for more details).
• Fixed bug in how the GATK counts filtered reads in the traversal output.
• Added a new tool called Qualify Intervals.
• Fixed major bug in the BCF encoding (the previous version was producing problematic files that were failing when trying to be read back into the GATK).
• Picard/sam/tribble/variant jars updated to version 1.96.1534.

Created 2015-05-30 13:45:41 | Updated

Hello,

I'm quite new to SNP calling. I am trying to setup a pipeline which includes GATK IndelRealigner as a final step. My bam file (before realignment) is a little over 1GB. After running the indel realigner however, it's reduced to 18MB! I'm assuming its throwing out way too many reads or something has gone wrong.

I'm calling the indel realigner with the default options as follows:

java -Xmx16g -jar GATK_DIR/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R /path/to/my/ref \ -I input.bam.intervals \ -targetIntervals input.bam.intervals \ -o realn.bam \  I am generating the read groups using AddOrReplaceReadGroups.jar (from picard tools) and interval file using GATK RealignerTargetCreator with default options. My bam file was generated off the raw reads of experiment SRA181417 fetched from SRA (after cleaning adapters using cutadapt, mapping to reference using bwa-mem, and removing duplicate reads using picard tools) I have tried this on other reads and do not have the same issue. Can anyone comment on why indel realigner could be throwing out so many reads. Thank you Created 2015-05-13 12:41:10 | Updated | Tags: indelrealigner bam error Hello, I run into a problem after the pre-processing, it seems that extra contigs where added to my bam file compared to the reference I used, which make the indel realigner step impossible to do. I have checked the headers of my file and the reference is the same but my bam file as a hundreds of additional contigs. Not sure what happen. The steps to get the bam where: - Aligned with bwa mem - Transform to bam and sort (Samtools) - Dedup (picard) - Add read group (picard) - Index bam (samtools) - Run Realigner target creator When I check the header of my bam file it still show the right contigs but when running it complains of difference (additional) compare to my reference. I am currently re-testing the whole pipeline on a single sample but if you have any pointer to what could cause this, maybe a problem with the bam formating? I am running GATK 3.3.0-g37228af Java 1.7 I have attached the ouput log from the command. Thanks, Julien PS: I attended your workshop in Cambridge! Created 2015-05-05 09:47:41 | Updated | Tags: indelrealigner realignertargetcreator I was just wondering what you guys thought of my realignment intervals length distribution. This is 30Mb from a single diploid sample without prior indel position information. Approximately 60,000 events , i.e. one every fifty bases seems like a lot. How indicative of true indels is the data from TargetCreator and IndelRealigner? Guess I'll have to check with the ug-vcf calls... Across the genome, distribution of 'all' events is uniform. Does multi-sample realignment improve the accuracy or efficiency of the realignment process ? Created 2015-04-27 06:58:38 | Updated | Tags: indelrealigner bqsr Hi, I have gone through the Realignment step and found Re aligner will change the CIGAR of alignment in bam file. Most of the structaral variant detection tool dependent upon CIGAR field. So my question is it right to consider re calibrated bam file, does it has any advantage for SV Detection over Raw Sorted Bam file..? Created 2015-03-29 01:40:41 | Updated | Tags: indelrealigner fixmisencodedquals exome-seq fix-misencoded-quality-scores Dear GATK team, I have two input fasta files from exome-seq. One is coded with Q64 and the other is coded with Q33 quality scores. I want to combine the two input fasta files and run bwa+GATK. How do I combine them for IndelRealigner? I suppose that IndelRealigner needs all reads from both Q64 and Q33. Can I do IndelRealigner separately and then join them? Will this cause problems? I have searched for many posts but can't find my answers. Please help me. Thanks, Woody Created 2015-03-26 14:47:13 | Updated | Tags: indelrealigner I have been trying to run IndelRealigner with the following commands (tumorPfx etc are file names)

java -d64 -jar $gatkJar -R$hgReference -T IndelRealigner -rf BadCigar -I $tumorPfx.bam -known$G1000_Mills -known $G100\ 0_Phase1_Indels -targetIntervals$tumorSample.intervals -o tumorPfx.realn.bam and have been getting the following output and error: ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:635) On this website, I found a similar error with somebody trying to run HaplotypeCaller, albeit with Index and Size = 3, where it was remarked that it may be due to a bug or a java version issue. Is the same thing going on here? Thank you, Max Created 2015-03-15 21:46:07 | Updated | Tags: indelrealigner malformedbam Hi Team, I get an error with gatk in variant calling steps, using BAM file from realignment step. The error indicated something wrong with the bai file. So I tried to create it new. But then this comes up, saying there is something wrong with the bam (see below) This bam was created with IndelRealigner (no errors) Thanks! Alexander  picard 1 BuildBamIndex INPUT=B57.3.bam [Sun Mar 15 22:37:46 CET 2015] picard.sam.BuildBamIndex INPUT=B57.3.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Sun Mar 15 22:37:46 CET 2015] Executing as kaktus42@soroban on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater [Sun Mar 15 22:41:19 CET 2015] picard.sam.BuildBamIndex done. Elapsed time: 3,55 minutes. Runtime.totalMemory()=855638016 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.FileTruncatedException: Premature end of file at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:382) at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:127) at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:252) at java.io.DataInputStream.read(DataInputStream.java:149) at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:404) at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380) at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:366) at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:199) at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:660) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:634) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:527) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:501) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:287) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:271) at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:138) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Created 2015-02-16 08:45:57 | Updated

If I am not having information about known variants. Is it fine if I skip the BQSR step after indelRealignment step??

Created 2015-02-12 07:12:06 | Updated

Hello Do you recommened realign around indels and recalibrate quality score before running Mutect? Thanks!

Created 2015-01-26 20:35:50 | Updated

'just asking for confirmation: if I run IndelRealigner with option -L my.capture.bed does indelrealigner:

• keep all the reads but only realign them in region of the bed ?
• or only keep the reads in the given region (smaller BAM)?

Thanks

Created 2015-01-14 00:52:00 | Updated 2015-01-14 00:56:38

Hi GATK team,

Though I have read the seminar slides for GATK indelrealignment, I still have no idear about how GATK does that for us, is there anyone can suggest? Say, give me a reference, or a brief ìntroduction.

Actually, what I care most is whether indelrealignment takes base quality into consideration.Thank you very much.

bless~

Created 2014-12-26 21:09:34 | Updated

Hi GATK team, my jobs are currently running and I'm a little bit lazy to try this later: I saw that the .interval files produced by RealignerTargetCreator can be quite large. Can I use a ".interval.gz" extension on the command line of RealignerTargetCreator ? Can I use this *.gz file with IndelRealigner ?

Created 2014-12-08 15:37:48 | Updated

Hi. I am using IndelRealigner for local indel realignment. The bam used as input is 6.6GB, while the realigned bam is 22GB.

Did I miss anything there?

The pipeline I used is as below:

echo "Patient {sample}: @create intervals for local realignment" sudo java -Djava.io.tmpdir={out_dir}/tmpdir \
-Xmx${maxMem} -Xms${minMem} \
-jar {gatk} \ -T RealignerTargetCreator \ -I{out_dir}/${input_next} \ -o${out_dir}/{input_next}.forRealigner.intervals \ -R{reference} \
-L ${intervals} \ --interval_padding 200 \ -rf${reads_filter} \
-known ${kg_mills} \ -known${kg_indels} \
-nt ${maxDataThread} \ --allow_potentially_misencoded_quality_scores \ 2>${out_dir}/logs/${sample_prefix}_createIntervals.err echo "Patient${sample}: @local realignment"
sudo java -Djava.io.tmpdir=${out_dir}/tmpdir \ -Xmx${maxMem} -Xms${minMem} \ -jar$gatk \
-T IndelRealigner \
-I ${out_dir}/${input_next} \
-o ${out_dir}/${sample_prefix}.dedup.realigned.bam \
-R ${reference} \ -targetIntervals${out_dir}/{input_next}.forRealigner.intervals \ -rf{reads_filter} \
-known ${kg_mills} \ -known${kg_indels} \
-compress 0 \
-LOD 0.4 \
--allow_potentially_misencoded_quality_scores \
2> ${out_dir}/logs/${sample_prefix}_realignment.err


Thanks.

Created 2014-11-18 15:48:00 | Updated

Hello, I want to use gatk indelrealigner in parallel mode. I'm looking for a qscript, because -nt and -nct flags will not work. I use the actual queue version 3.3-0.

Joern

Created 2014-11-11 10:24:07 | Updated 2014-11-11 10:44:58

Hi,

Recently I experienced a slightly annoying problem with IndelRealigner loosing some reads. It is usually just few reads missing from the output, but when I compare the output and input and extract the reads taht are missing after the IndelRealigner job, I cannot see what is wrong with them. An example of one such read is below:

M01823:187:000000000-AB050:1:1109:16397:19623 69 8 64405501 0 * = 64405501 0 TTTGCTTTCAAAAATACCTGTGCAGGTGGAGGTGTGCGTCTGCGTCTAACGGTGTGCGGTGCGAATTTCGACGATCGTTGCATTAACTTGCGAAACCCCTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAAATAAAACAAACAAAACGAACTACTACAGACAACGACAAAAACCAAAAAACAACATATAAACAAATAAACGAGCAACACAACACAAATAAAAGAGCAAGCACTACAC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG885+3355<,8,,=,,,,,,3,,=4::?,,7,,7,*2+14<********/*/2***1/*0+++2++2/+++++*2*12*2*****2*;*/++2+++1:68***20++++* RG:Z:140919_M01823_0188_000000000-AB050_AACCCCTC-TGTTCTCT_L001 AS:i:0 XS:i:0

It's pair has been kept, but this read was removed.

It is a bit of nuisance, as in our workflow we check the number of reads in the files after various steps for sanity, so varying number of reads introduces problems. I would be grateful if you could adviice why some reads get ommitted by IndelRealigner so I could modufy our workflow accordingly. Or could it be a bug?

Thank you, Dalia

Created 2014-09-18 11:58:28 | Updated

Hello, I'm trying to realign approximately 115 bam files. I am able to do this with the -o command, but this results in an impressively large bam file that I cannot fix in Picard (FixMateInformation and SortSam). Unfortunately these are corrections that need to happen before the downstream GATK snp discovery. So I tried the -nWayOut command, to get an individual realigned bam file for each input, but this returns a stack trace ERROR that includes something about an unavailable reader id. I've pasted it below.

INFO 14:06:32,838 ProgressMeter - scaffold_0:4430818 1.17606954E8 59.5 m 30.0 s 0.4% 9.8 d 9.7 d INFO 14:07:32,840 ProgressMeter - scaffold_0:4474066 1.18707144E8 60.5 m 30.0 s 0.4% 9.8 d 9.8 d INFO 14:08:32,841 ProgressMeter - scaffold_0:4505563 1.1980727E8 61.5 m 30.0 s 0.4% 9.9 d 9.9 d INFO 14:09:32,843 ProgressMeter - scaffold_0:4506325 1.20407434E8 62.5 m 31.0 s 0.4% 10.1 d 10.0 d INFO 14:09:55,236 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR MESSAGE: Cannot enable index memory mapping for a SAM text reader

Created 2013-02-14 19:41:01 | Updated 2013-02-15 06:36:36

Hello, I am a first-time user of GATK and have spent some time now on trying to get the input bam files in the appropriate format. To run IndelRealigner, I have added ReadGroups, Reordered and Index my bam file with the respective Picard-Tools.

My command-line is the following:

java -Djava.io.tmpdir='pwd'/tmp -jar GenomeAnalysisTK.jar -I ./add_read_groups_reorder_index.bam -R ./genome.fa -T IndelRealigner -targetIntervals ./gatk.intervals -o ./*.bam -known ./Mills-1000G-indels.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4


I get the following message:

SAM/BAM file /home/gp53/tophat2-merge-ctl-1st-2nd-readgroups-reorder-index.bam is malformed: SAM file doesn't have any read groups defined in the header.


My reads are paired-end aligned with TopHat2 I will appreciate your help on this. Thanks, G.

Created 2013-02-14 12:37:20 | Updated

Hi,

When doing an indel realignment with GATK, the 'MD' field in the SAM/BAM record gets dropped for realigned reads. Is it possible to recompute them directly with GATK? I know that 'samtools calmd' does this, but are there alternative options?

Created 2013-02-04 17:31:24 | Updated

Does anyone know of any known issues with the indelrealigner? The GATK is calling 1000s of SNPs on one genome I have due to bad realignments. It appears the target finder identifies regions of the genome that are essentially perfectly aligned, but when the realigner gets to these areas it remits the reads as a new alignment that is a train-wreck compared to what the re-aligner started with. Am going to investigate further but thought I would check if this rings any bells. I am using the flag -model USE_READS and have no known indels to work with.

Created 2012-12-20 01:21:50 | Updated

Hi, Some aligners produce Smith-Waterman alignments and may soft clip bases from a read when there are indels or mismatches near the ends of the reads. I was wondering if you include these bases in the realignment process? And if not whether you might consider making it an option? Thanks, Colin

Created 2012-11-28 11:47:29 | Updated

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Created 2012-11-27 15:38:23 | Updated

I can't seem to run the IndelRealigner on reads that contain colons, ":" in the reference scaffold names. The RealignerTargetCreator step works correctly and generates the interval table, but the second, IndelRealigner, step fails. When I look at the generated interval table, I see the interval delimiter is a colon, which I imagine is the problem.

Unfortunately, I have a set of human references that have a colon in every scaffold name, so changing this would be a massive undertaking.

I believe this problem could be solved if you searched for the colon delimiter from the end of the interval string instead of from the beginning, so I'm hoping this a real simple fix.

Thanks!

Created 2012-11-26 14:38:52 | Updated 2012-12-02 05:20:34

Hi. I am getting VERY odd results with some Streptococcus equi sequence. The BAM files from BWA align well in IGV, but when I run them through your pipeline there are many local errors where it seems that a single indel has been incorrectly multiplied up - somehow. You need to see the IGV screenshot.!

The bottom is a BAM file from BWA and the top is the final one from the GATK pipeline.

Created 2012-11-15 16:08:05 | Updated 2012-11-15 22:59:14

Hi, For both IndelRealigner/RealignerTargetCreator, there is an option for known indel sites as below:

-known /path/to/indels.vcf


However, from the bundle files collection such as from hg19, there are several vcf files:

1000G_indels_for_realignment.hg19.vcf
1000G_omni2.5.hg19.sites.vcf
1000G_omni2.5.hg19.vcf
dbsnp_132.hg19.excluding_sites_after_129.vcf
dbsnp_132.hg19.vcf
hapmap_3.3.hg19.sites.vcf
hapmap_3.3.hg19.vcf
indels_mills_devine.hg19.sites.vcf
indels_mills_devine.hg19.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf


amongst them, just based on the names, 1000G_indels_for_realignment.hg19.vcf and indels_mills_devine.hg19.sites.vcf look like the files supposed to use for IndelRealigner/RealignerTargetCreator, Could you clarify the exact files for this purpose?

Since for old version, I have used 1000G_phase1.indels.hg19.vcf and Mills_and_1000G_gold_standard.indels.hg19.sites.vcf. and I compared the new and old files, quite different now.

Thanks

Mike

Created 2012-11-12 22:55:36 | Updated 2013-01-07 20:06:43

Hello,

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

• MarkDuplicates
• Count Covariates
• Table Recalibration
• Realigner Target Creator
• Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.

Mike

Created 2012-11-08 12:31:03 | Updated 2012-11-08 18:05:56

HI

I am using the following set of commands on GATK2.1.13 to generate a VCF file

echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator  -o my.intervals -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key
echo "Realignment Done at date"
echo "Starting IndelRealigner at date"

echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam  -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key
echo "Realignment done at date"
echo "Starting UnifiedGenotyper at date"
echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper    -I myrealignedBam.bam    -o mygatk_vcf.vcf    --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key
echo "Gentoypxing complete at date"


When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!

Do you know what is wrong ?

Created 2012-10-31 07:52:07 | Updated 2012-10-31 17:35:25

Hi, I've run into what appears to be a bug in handling output in IndelRealigner. When specifying --nWayOut everything works, but when I add --disable_bam_indexing, it appears to be expecting --out instead?

##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365):
....
##### ERROR MESSAGE: Value for argument with name '--out' (-o) is missing.


Created 2012-10-19 17:50:54 | Updated 2012-10-19 18:17:30

Hi everyone,

I'm using IndelRealigner to do a local realignment in the standard GATK workflow. I have used this pipeline before with success, but am now met with this error. I could not find any other examples of this, so I am posting as per the instructions in the error.

Cheers,

A.B.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.1-11-g13c0244):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------


~

Created 2012-09-25 19:30:15 | Updated 2012-09-25 19:32:13

I got this when I ran the IndelRealigner. The output bam is empty.

INFO  15:19:35,568 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours
INFO  15:19:36,910 GATKRunReport - Uploaded run statistics report to AWS S3


It didn't initialize. The sample is aligned to specific region of the genome and I did use -L option. For whole genome alignment of the same sample, I don't have any problems. Do you know why?

Created 2012-09-25 08:04:12 | Updated 2012-09-26 14:31:52

Hi,

Apologies if this has been reported but I can't find it in the forum.

We're in the process of upgrading to GATK v2 but have been using v1.5 and have just noticed a few cases where IndelRealigner suddenly ended without warning or any report of an error. See example below where it ended with only ~50% of the BAM file processed. I'm wondering if it's a memory issue if multiple samples were being run concurrently. But more importantly with no alert it makes it tricky for us to identify when this happens. Is this something that's been fixed in later versions e.g. GATK 2.1 i.e. will Indelrealigner report an error when it finishes but the sample has not been processed to completion?

INFO  16:21:03,120 TraversalEngine -      8:90782004        3.17e+07    3.3 h        6.3 m     47.8%         6.9 h     3.6 h
INFO  16:21:33,939 TraversalEngine -      8:99615949        3.18e+07    3.3 h        6.3 m     48.1%         6.9 h     3.6 h
INFO  16:22:04,047 TraversalEngine -     8:110498944        3.19e+07    3.3 h        6.2 m     48.5%         6.9 h     3.5 h
INFO  16:22:24,484 TraversalEngine - Total runtime 11978.49 secs, 199.64 min, 3.33 hours
INFO  16:22:24,509 TraversalEngine - 0 reads were filtered out during traversal out of 32137673 total (0.00%)


Best regards, Maria

Created 2012-09-17 17:42:03 | Updated 2012-09-17 18:02:25

After I ran "IndelRealigner" tool, I saw the following message in the end of the run log, is it normal that 0 reads were filtered out during this step?

-------
INFO  07:01:50,692 TraversalEngine - 0 reads were filtered out during traversal out of 1529770054 total (0.00%)
-------


JH

Created 2012-09-09 02:52:32 | Updated 2012-09-09 02:53:52

My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as --knownSites or --known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.

Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?

Created 2012-08-03 05:30:36 | Updated 2012-08-03 05:32:30

Hi all - I'm using GATK realigner which can take several hours on my samples. I'm trying to optimize my pipeline by dividing this up by chromosome for each node in my cluster. I can call RealignerTargetCreator using the -L parameter for each chromosome which results in a bunch of interval files. Now, I either want to call IndelRealigner using the -L parameter for each chromsome then merge the resulting BAM files, or merge the interval files into one then call IndelRealigner.

1) I don't see a way to merge interval files using GATK. Is this possible?

or

2) Can I call IndelRealigner and process each chromosome separately then merge the resulting BAM files together?

Created 2012-07-31 10:02:07 | Updated 2012-07-31 10:02:07

Dear GATK Team,

I have recently downloaded the GATK Bundle to get the human reference genome and its associated annotations.

After the mapping step on my lane BAM files, I am planning on using IndelRealigner and BaseRecalibrator as it is explained in the "Best Practices v4".

I am always confused about which annotation file I should use for my analysis.

For the Indel realignment, in the command line arguments of RealignerTargetCreator, one have to set the '--known' switch to indicate known indel sites.

--known:indels,vcf Mills_and_1000G_gold_standard.indels.b37.sites.vcf --known:dbsnp,vcf dbsnp_135.b37.vcf

But in the annotations folder, you can also find 'dbsnp_135.b37.excluding_sites_after_129.vcf' for dbsnp (version before 1000K genomes). Depending on which one I use the target intervals files are pretty different. So I am really wondering which one should be used in my case ? Or is there any other factor that could drive me to the better choice ?

I have a similar dilemna with base recalibration, "dbsnp_135.b37.vcf" or "dbsnp_135.b37.excluding_sites_after_129.vcf" in the '-knownSites' switch ?

Thanks a lot, Best,

Anthony

Created 2012-07-26 05:07:14 | Updated 2012-10-19 16:51:54

Dear GATK team,

Thanks a lot for the new GATK version and GATK forum!

I am trying to use GATK for yeast strains. I do not have files of known sites of SNPs/indels. I understand that the BaseRecalibrator must get such a file. Do you suggest to skip calibration and realignment, or is there another way to go here?

Created 2012-07-24 14:17:40 | Updated 2012-07-24 14:19:56

Dear all,

I was browsing through some of the less used functions in the GATK documentation, hence the following question: Does the LeftAlignIndels function do something additional that is not happening with IndelRealigner? In other words, do you recommend to run LeftAlignIndels on top of the indel realignment?

Best regards, Sophia