For a complete, detailed argument reference, refer to the GATK document page here.
For a complete, detailed argument reference, refer to the GATK document page here.
While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.
The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.
java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -o /path/to/output.intervals \ -known /path/to/indel_calls.vcf
The IndelRealigner step needs to be run on every bam file.
java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \ -jar /path/to/GenomeAnalysisTK.jar \ -I <lane-level.bam> \ -R <ref.fasta> \ -T IndelRealigner \ -targetIntervals <intervalListFromStep1Above.intervals> \ -o <realignedBam.bam> \ -known /path/to/indel_calls.vcf --consensusDeterminationModel KNOWNS_ONLY \ -LOD 0.4
After I ran "IndelRealigner" tool, I saw the following message in the end of the run log, is it normal that 0 reads were filtered out during this step?
------- INFO 07:01:50,692 TraversalEngine - 0 reads were filtered out during traversal out of 1529770054 total (0.00%) -------
Does anyone know of any known issues with the indelrealigner? The GATK is calling 1000s of SNPs on one genome I have due to bad realignments. It appears the target finder identifies regions of the genome that are essentially perfectly aligned, but when the realigner gets to these areas it remits the reads as a new alignment that is a train-wreck compared to what the re-aligner started with. Am going to investigate further but thought I would check if this rings any bells. I am using the flag -model USE_READS and have no known indels to work with.
Hi, I've 32 exomes in a merged BAM file, I have read groups to identify each exome, with id, sample, library and platform all set (I hope) correctly, what I can't understand why I have a discrepancy in the logs generated about the number of reads traversed by RealignmentTargetcreator Vs IndelRealigner:
From my RealignmentTargetcreator run I got:
INFO 18:24:17,286 ProgressMeter - Total runtime 5394.91 secs, 89.92 min, 1.50 hours
INFO 18:24:17,286 MicroScheduler - 389,565,880 reads were filtered out during traversal out of 2,752,553,629 total (14.15%)
INFO 18:24:17,287 MicroScheduler - -> 24573133 reads (0.89% of total) failing BadMateFilter
INFO 18:24:17,287 MicroScheduler - -> 229871090 reads (8.35% of total) failing DuplicateReadFilter
INFO 18:24:17,287 MicroScheduler - -> 135121273 reads (4.91% of total) failing MappingQualityZeroFilter
INFO 18:24:17,298 MicroScheduler - -> 384 reads (0.00% of total) failing UnmappedReadFilter
Vs this from IndelRealigner
INFO 11:47:21,707 MicroScheduler - 0 reads were filtered out during traversal out of 200,100 total (0.00%)
Why have so few of the reads RealignmentTarget creator reported been traversed by IndelRealigner?
My command for IndelRealigner was:
GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner --maxReadsInMemory 1000000 --maxReadsForRealignment 1000000 -known /data/GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -known /data/GATK_bundle/hg19/1000G_phase1.indels.hg19.vcf -I Merged_dedup.bam -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -targetIntervals forIndelRealigner.intervals -o Merged_dedup_realigned.bam
Hi, Some aligners produce Smith-Waterman alignments and may soft clip bases from a read when there are indels or mismatches near the ends of the reads. I was wondering if you include these bases in the realignment process? And if not whether you might consider making it an option? Thanks, Colin
Dear GATK team,
Thanks a lot for the new GATK version and GATK forum!
I am trying to use GATK for yeast strains. I do not have files of known sites of SNPs/indels. I understand that the BaseRecalibrator must get such a file. Do you suggest to skip calibration and realignment, or is there another way to go here?
I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:
I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:
What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.
Any help on this would be appreciated.
I am trying to decide between two approaches for performing realignment around indels. I have ~600 samples that have been aligned to a very fragmented draft genome assembly.
What is best:
1. take each sample and create a list of targets, followed by realignment on each sample.
2. combine all samples into one large bam file and create a list of targets, followed by realignment on the same large bam file.
Also, would there be any advantages in terms of speed with either approach?
My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as
--known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using
SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as
AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.
Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?
When doing an indel realignment with GATK, the 'MD' field in the SAM/BAM record gets dropped for realigned reads. Is it possible to recompute them directly with GATK? I know that 'samtools calmd' does this, but are there alternative options?
I am using the following set of commands on GATK2.1.13 to generate a VCF file
echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator -o my.intervals -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key` echo "Realignment Done at `date`" echo "Starting IndelRealigner at `date`" echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key` echo "Realignment done at `date`" echo "Starting UnifiedGenotyper at `date`" echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper -I myrealignedBam.bam -o mygatk_vcf.vcf --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key` echo "Gentoypxing complete at `date`"
When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!
Do you know what is wrong ?
I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.
Any help or guidance would be greatly appreciated!
Command Used for HaplotypeCaller:
java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG
Commands Used for pre-processing (run in sequence using a Perl script):
java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf
java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam
samtools index exome.realigned.dedup
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate
java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam
Hi all - I'm using GATK realigner which can take several hours on my samples. I'm trying to optimize my pipeline by dividing this up by chromosome for each node in my cluster. I can call RealignerTargetCreator using the -L parameter for each chromosome which results in a bunch of interval files. Now, I either want to call IndelRealigner using the -L parameter for each chromsome then merge the resulting BAM files, or merge the interval files into one then call IndelRealigner.
1) I don't see a way to merge interval files using GATK. Is this possible?
2) Can I call IndelRealigner and process each chromosome separately then merge the resulting BAM files together?
Dear GATK Team,
I have recently downloaded the GATK Bundle to get the human reference genome and its associated annotations.
After the mapping step on my lane BAM files, I am planning on using IndelRealigner and BaseRecalibrator as it is explained in the "Best Practices v4".
I am always confused about which annotation file I should use for my analysis.
For the Indel realignment, in the command line arguments of RealignerTargetCreator, one have to set the '--known' switch to indicate known indel sites.
--known:indels,vcf Mills_and_1000G_gold_standard.indels.b37.sites.vcf --known:dbsnp,vcf dbsnp_135.b37.vcf
But in the annotations folder, you can also find 'dbsnp_135.b37.excluding_sites_after_129.vcf' for dbsnp (version before 1000K genomes). Depending on which one I use the target intervals files are pretty different. So I am really wondering which one should be used in my case ? Or is there any other factor that could drive me to the better choice ?
I have a similar dilemna with base recalibration, "dbsnp_135.b37.vcf" or "dbsnp_135.b37.excluding_sites_after_129.vcf" in the '-knownSites' switch ?
Thanks a lot, Best,
Hi, I am calling indel in pooled samples using this command: java -jar -Xmx2g /PATH/2.1.13/GenomeAnalysisTK.jar -l INFO -T UnifiedGenotyper -I pool1.bam -I pool2.bam --out INDEL.vcf -R /reference.fa -glm INDEL
Currently i donot have any information of already known indels. 1.Do i need to first realign (RealignerTargetCreator and IndelRealigner) and then call indels even for pooled data? 2. How different will this be for calling indel on individual sample?
Looking forward for your suggesions. with thanks sasha
Hi, I am doing some whole genome sequence on 2 samples in which each sample was run on 12 lanes of a SOLiD 5500 machine. These are at fairly high coverage of ~40x each. My plan was align each lane independently then merge all 12 lane for each sample into 1 large bam file, then do the post-processing. I did this and was able to do the indel realign on both samples but have been having trouble with the base recalibration step, in which when apply the base recalibration PrintReads crashes with an error saying that there is not enough memory available. I have tried changing the tmp directory that is used and any other trick that I have been able to find on the forum.
I was wondering if an alternate and suitable approach would be to perform all of the post-processing on each individual lane first, then merge all of the lanes together after that. Would doing that have any adverse affect on the downstream analysis, i.e snps, cnvs, translocations, etc.
I can't seem to run the IndelRealigner on reads that contain colons, ":" in the reference scaffold names. The RealignerTargetCreator step works correctly and generates the interval table, but the second, IndelRealigner, step fails. When I look at the generated interval table, I see the interval delimiter is a colon, which I imagine is the problem.
Unfortunately, I have a set of human references that have a colon in every scaffold name, so changing this would be a massive undertaking.
I believe this problem could be solved if you searched for the colon delimiter from the end of the interval string instead of from the beginning, so I'm hoping this a real simple fix.
I'm using IndelRealigner to do a local realignment in the standard GATK workflow. I have used this pipeline before with success, but am now met with this error. I could not find any other examples of this, so I am posting as per the instructions in the error.
ERROR ------------------------------------------------------------------------------------------ ERROR stack trace java.lang.ArrayIndexOutOfBoundsException at org.broadinstitute.sting.utils.sam.AlignmentUtils.createIndelString(AlignmentUtils.java:710) at org.broadinstitute.sting.utils.sam.AlignmentUtils.leftAlignIndel(AlignmentUtils.java:603) at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.determineReadsThatNeedCleaning(IndelRealigner.java:912) at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.clean(IndelRealigner.java:681) at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.cleanAndCallMap(IndelRealigner.java:547) at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:519) at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:114) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:104) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93) ERROR ------------------------------------------------------------------------------------------ ERROR A GATK RUNTIME ERROR has occurred (version 2.1-11-g13c0244): ERROR ERROR Please visit the wiki to see if this is a known problem ERROR If not, please post the error, with stack trace, to the GATK forum ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR MESSAGE: Code exception (see stack trace for error itself) ERROR ------------------------------------------------------------------------------------------
Hello, I am a first-time user of GATK and have spent some time now on trying to get the input bam files in the appropriate format. To run IndelRealigner, I have added ReadGroups, Reordered and Index my bam file with the respective Picard-Tools.
My command-line is the following:
java -Djava.io.tmpdir='pwd'/tmp -jar GenomeAnalysisTK.jar -I ./add_read_groups_reorder_index.bam -R ./genome.fa -T IndelRealigner -targetIntervals ./gatk.intervals -o ./*.bam -known ./Mills-1000G-indels.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4
I get the following message:
SAM/BAM file /home/gp53/tophat2-merge-ctl-1st-2nd-readgroups-reorder-index.bam is malformed: SAM file doesn't have any read groups defined in the header.
My reads are paired-end aligned with TopHat2 I will appreciate your help on this. Thanks, G.
I got this when I ran the IndelRealigner. The output bam is empty.
INFO 15:19:35,568 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours INFO 15:19:36,910 GATKRunReport - Uploaded run statistics report to AWS S3
It didn't initialize. The sample is aligned to specific region of the genome and I did use -L option. For whole genome alignment of the same sample, I don't have any problems. Do you know why?
Hi, For both IndelRealigner/RealignerTargetCreator, there is an option for known indel sites as below:
However, from the bundle files collection such as from hg19, there are several vcf files:
1000G_indels_for_realignment.hg19.vcf 1000G_omni2.5.hg19.sites.vcf 1000G_omni2.5.hg19.vcf dbsnp_132.hg19.excluding_sites_after_129.vcf dbsnp_132.hg19.vcf hapmap_3.3.hg19.sites.vcf hapmap_3.3.hg19.vcf indels_mills_devine.hg19.sites.vcf indels_mills_devine.hg19.vcf NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf
amongst them, just based on the names, 1000G_indels_for_realignment.hg19.vcf and indels_mills_devine.hg19.sites.vcf look like the files supposed to use for IndelRealigner/RealignerTargetCreator, Could you clarify the exact files for this purpose?
Since for old version, I have used 1000G_phase1.indels.hg19.vcf and Mills_and_1000G_gold_standard.indels.hg19.sites.vcf. and I compared the new and old files, quite different now.
I was browsing through some of the less used functions in the GATK documentation, hence the following question: Does the LeftAlignIndels function do something additional that is not happening with IndelRealigner? In other words, do you recommend to run LeftAlignIndels on top of the indel realignment?
Best regards, Sophia
I have downloaded newest version of GATK (version 2.4-3) this week and tried to perform local realignment for my targeted sequencing data. Reference genome, SNP and indel data files were downloaded from resource bundle. However, I encountered two issues when I was doing the realignment.
First, in the step of RealignerTargetCreator. With the same command line, if I run it under version 2.4-3, I got an error message "MESSAGE: -49" (no other detail information provided); if I run it under an older version 2.3-9, it ran very well with no errors.
Second, in the step of IndelRealigner. I got error message "MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '13'". However, reference genome was downloaded from the bundle. I am not sure how to fix this issue.
I hope someone can help me with these issues. Let me know if more info is needed.
I've followed the suggested protocol for local realignment - first using RealignerTargetCreator and then IndelRealigner, but have unexpected results.
Let's call the two BAMs I'm realigning "normal" and "tumour" or N and T for short. Once realigned, I've split the resulting NT BAM file (using readgroup tags, although I see from the docs that it can create separate files natively) back into the original N and T BAM files and discovered something odd. I was expecting the pre-realignment N and T files to contain the same number of reads as the post-realignment files, only the coordinates that reads are mapped to would be different.
However, I notice that post-realignment files contain significantly fewer reads because unaligned reads and reads not aligned to the autosomes or sex chromosomes have been removed. However, these reads alone do not account for the difference; large numbers of reads aligned to the 24 chromosomes are now missing.
Can you tell me more about the reads that are removed? I suspect it to be an alignment quality issue, but cannot find direct reference to this behaviour in the documentation. I'm currently keeping both my pre and post-realignment bam files, but ultimately there will be space constraints and I'll have to choose and would like to make the most informed decision possible.
Hi, I got errors when ran GATK RealignerTargetCreator and IndelRealigner in v2.4.9. I've checked many related discussions and comments. First, I got an error like "we encountered an extremely high quality score of 69" with option -S LENIENT and the GATK program stalled. So I added "--fix_misencoded_quality_scores", and then I got different error message "ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '0'" now. I tried older versions of GATK and both java 1.6 and 1.7. I'm hoping that you can help this. Please let me know if I'm missing something. Thanks!
I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)
bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads
A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.
I have 85767226 paired end = 171534452 sequences in fastQ file
BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.
Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.
so nearly 6 million reads seem to miss.
CreateTargets MicroScheduler reports
35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter
so nearly 5 million reads seem to miss
The Realigner MicroScheduler reports
0 reads were filtered out during traversal out of 171551640 total (0.00%)
which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.
From Base recalibration MicroScheduler, I get
41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter
..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.
I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?
Thanks for any comments!
I have Bisulfite- treated sequence mapped using Bismark and Bowtie2 and I'd like to call SNPs and INDELs from it. I have used Bis-SNP to call SNPs but it doesn't call indels , can I use GATK to call indels from the mapped data? Do u have any support to Bisulfite data? Another question please, the data is a mix from 6 different people do u have any support fro pooled data? Thanks for your help.
Hi. I am getting VERY odd results with some Streptococcus equi sequence. The BAM files from BWA align well in IGV, but when I run them through your pipeline there are many local errors where it seems that a single indel has been incorrectly multiplied up - somehow. You need to see the IGV screenshot.!
The bottom is a BAM file from BWA and the top is the final one from the GATK pipeline.
HI I have used indelrealinger sucessfully in the past but I am now getting the following error. Please let me know if you have any suggestions.
[Fri Feb 15 09:09:37 GMT 2013] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/exports/home/fturner/vet_roslin_ark_genomics/reference_gen omes/chicken/Gallus_gallus.WASHUC2.69.dna.toplevel.fa OUTPUT=/exports/home/fturner/vet_roslin_ark_genomics/reference_genomes/chicken/dict129865 4176074562895.tmp TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 TMP_DIR=/tmp/fturner VERBOSITY=INFO QUIET=false VALIDATION_STRI NGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Fri Feb 15 09:09:37 GMT 2013] Executing as fturner@eddie327 on Linux 2.6.32-220.23.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0 _07-b10 [Fri Feb 15 09:09:52 GMT 2013] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.24 minutes. Runtime.totalMemory()=442957824
java.lang.UnsupportedOperationException: Cannot enable index memory mapping for a SAM text reader
I am using one of the 1000 Genomes exome data (*.bam format) which is called NA12892, aligned to the GRCh37 build. And I just started to use IndelRealigner tool which requires proper *.interval_list file to work.
However, I am unable to find/generate *.interval_list file compatible with my data. Where can I download/generate *.interval_list (or *.bed) file that are compatible with exome data?
Normally, these files are provided by producers for Library Prep. kits (e.g illumina). But, I couldn't find which interval file should be used in 1000 Genomes data.
Apologies if this has been reported but I can't find it in the forum.
We're in the process of upgrading to GATK v2 but have been using v1.5 and have just noticed a few cases where IndelRealigner suddenly ended without warning or any report of an error. See example below where it ended with only ~50% of the BAM file processed. I'm wondering if it's a memory issue if multiple samples were being run concurrently. But more importantly with no alert it makes it tricky for us to identify when this happens. Is this something that's been fixed in later versions e.g. GATK 2.1 i.e. will Indelrealigner report an error when it finishes but the sample has not been processed to completion?
INFO 16:21:03,120 TraversalEngine - 8:90782004 3.17e+07 3.3 h 6.3 m 47.8% 6.9 h 3.6 h INFO 16:21:33,939 TraversalEngine - 8:99615949 3.18e+07 3.3 h 6.3 m 48.1% 6.9 h 3.6 h INFO 16:22:04,047 TraversalEngine - 8:110498944 3.19e+07 3.3 h 6.2 m 48.5% 6.9 h 3.5 h INFO 16:22:24,484 TraversalEngine - Total runtime 11978.49 secs, 199.64 min, 3.33 hours INFO 16:22:24,509 TraversalEngine - 0 reads were filtered out during traversal out of 32137673 total (0.00%)
Thank you in advance.
Best regards, Maria
Hi, I've run into what appears to be a bug in handling output in IndelRealigner. When specifying --nWayOut everything works, but when I add --disable_bam_indexing, it appears to be expecting --out instead?
##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365): .... ##### ERROR MESSAGE: Value for argument with name '--out' (-o) is missing.