Detailed information about command line options for BaseRecalibrator can be found here.
The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.
New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.
This process is accomplished by analyzing the covariation among several features of a base. For example:
These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.
For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.
The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.
Detailed information about command line options for BaseRecalibrator can be found here.
This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:
For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.
To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:
java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam
After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.
This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.
Effectively the new quality score is:
Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.
The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:
This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).
Example Arguments table:
#:GATKTable:true:1:17::; #:GATKTable:Arguments:Recalibration argument collection values used in this run Argument Value covariate null default_platform null deletions_context_size 6 force_platform null insertions_context_size 6 ...
The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.
The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.
Example Arguments table:
#:GATKTable:true:2:94:::; #:GATKTable:Quantized:Quality quantization map QualityScore Count QuantizedScore 0 252 0 1 15972 1 2 553525 2 3 2190142 9 4 5369681 9 9 83645762 9 ...
This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:; #:GATKTable:RecalTable0: ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors SRR032768 D 40.7476 45.0000 2642683174 222475 SRR032766 D 40.9072 45.0000 2630282426 213441 SRR032764 D 40.5931 45.0000 2919572148 254687 SRR032769 D 40.7448 45.0000 2850110574 240094 SRR032767 D 40.6820 45.0000 2820040026 241020 SRR032765 D 40.9034 45.0000 2441035052 198258 SRR032766 M 23.2573 23.7733 2630282426 12424434 SRR032768 M 23.0281 23.5366 2642683174 13159514 SRR032769 M 23.2608 23.6920 2850110574 13451898 SRR032764 M 23.2302 23.6039 2919572148 13877177 SRR032765 M 23.0271 23.5527 2441035052 12158144 SRR032767 M 23.1195 23.5852 2820040026 13750197 SRR032766 I 41.7198 45.0000 2630282426 177017 SRR032768 I 41.5682 45.0000 2642683174 184172 SRR032769 I 41.5828 45.0000 2850110574 197959 SRR032764 I 41.2958 45.0000 2919572148 216637 SRR032765 I 41.5546 45.0000 2441035052 170651 SRR032767 I 41.5192 45.0000 2820040026 198762
This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable1: ReadGroup QualityScore EventType EmpiricalQuality Observations Errors SRR032767 49 M 33.7794 9549 3 SRR032769 49 M 36.9975 5008 0 SRR032764 49 M 39.2490 8411 0 SRR032766 18 M 17.7397 16330200 274803 SRR032768 18 M 17.7922 17707920 294405 SRR032764 45 I 41.2958 2919572148 216637 SRR032765 6 M 6.0600 3401801 842765 SRR032769 45 I 41.5828 2850110574 197959 SRR032764 6 M 6.0751 4220451 1041946 SRR032767 45 I 41.5192 2820040026 198762 SRR032769 6 M 6.3481 5045533 1169748 SRR032768 16 M 15.7681 12427549 329283 SRR032766 16 M 15.8173 11799056 309110 SRR032764 16 M 15.9033 13017244 334343 SRR032769 16 M 15.8042 13817386 363078 ...
This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable2: ReadGroup QualityScore CovariateValue CovariateName EventType EmpiricalQuality Observations Errors SRR032767 16 TACGGA Context M 14.2139 817 30 SRR032766 16 AACGGA Context M 14.9938 1420 44 SRR032765 16 TACGGA Context M 15.5145 711 19 SRR032768 16 AACGGA Context M 15.0133 1585 49 SRR032764 16 TACGGA Context M 14.5393 710 24 SRR032766 16 GACGGA Context M 17.9746 1379 21 SRR032768 45 CACCTC Context I 40.7907 575849 47 SRR032764 45 TACCTC Context I 43.8286 507088 20 SRR032769 45 TACGGC Context D 38.7536 37525 4 SRR032768 45 GACCTC Context I 46.0724 445275 10 SRR032766 45 CACCTC Context I 41.0696 575664 44 SRR032769 45 TACCTC Context I 43.4821 490491 21 SRR032766 45 CACGGC Context D 45.1471 65424 1 SRR032768 45 GACGGC Context D 45.3980 34657 0 SRR032767 45 TACGGC Context D 42.7663 37814 1 SRR032767 16 AACGGA Context M 15.9371 1647 41 SRR032764 16 GACGGA Context M 18.2642 1273 18 SRR032769 16 CACGGA Context M 13.0801 1442 70 SRR032765 16 GACGGA Context M 15.9934 1271 31 ...
The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.
If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.
I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.
All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.
I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?
Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.
I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.
The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.
However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:
For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.
Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.
This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!
GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Hi,
I did search on the site and seems several have had a similar problem but I couldn't fix mine based on those. I've managed to get to this point succesfully with creating a realigned bam file with IndelRealigner. I'm using a non-model organism with self created masking list from a VCF run done before the realignment (list in bed format). I've had to use the -fixMisencodedQuals so far through out my runs. I checked the masking file so that the ranges do not go over the ranges that are specified in my list of regions to cover (-L option).
Is there anything else I would need to check still?
Command line view:
sulyba@hippu4:/fs/lustre/wrk/sulyba/stickleback_capture> java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals
INFO 12:34:10,502 HelpFormatter - --------------------------------------------------------------------------------
INFO 12:34:10,510 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14
INFO 12:34:10,510 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 12:34:10,510 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 12:34:10,516 HelpFormatter - Program Args: -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals
INFO 12:34:10,516 HelpFormatter - Date/Time: 2013/02/01 12:34:10
INFO 12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO 12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO 12:34:10,551 ArgumentTypeDescriptor - Dynamically determined type of Sites_to_Mask.bed to be BED
INFO 12:34:10,563 GenomeAnalysisEngine - Strictness is SILENT
INFO 12:34:11,309 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 12:34:11,319 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 12:34:11,392 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
INFO 12:34:11,416 RMDTrackBuilder - Loading Tribble index from disk for file Sites_to_Mask.bed
INFO 12:34:11,660 GenomeAnalysisEngine - Processing 20350670 bp from intervals
INFO 12:34:11,672 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 12:34:11,674 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 12:34:11,946 BaseRecalibrator - The covariates being used here:
INFO 12:34:11,947 BaseRecalibrator - ReadGroupCovariate
INFO 12:34:11,947 BaseRecalibrator - QualityScoreCovariate
INFO 12:34:11,947 BaseRecalibrator - ContextCovariate
INFO 12:34:11,947 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 12:34:11,947 BaseRecalibrator - CycleCovariate
INFO 12:34:11,952 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO 12:34:11,952 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:34:11,952 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO 12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO 12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1002, 3]
INFO 12:34:11,954 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:34:11,954 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:34:12,109 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO 12:34:12,120 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO 12:34:12,644 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO 12:34:12,645 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO 12:34:14,901 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: -2
at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158)
at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:246)
at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542)
at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595)
at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530)
at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: -2
##### ERROR ------------------------------------------------------------------------------------------
Dear GATK Team,
I am running a pipeline on several high coverage human individuals that have been mapped using bwa and processed using samtools, picard and gatk. The bam-files pass ValidateSam from picard, but when I run the bqsr step some of them fails giving a Malformed read error (using -filterMBQ does not help in this case). I tracked down the error to bamfiles that ends with a paired end read where the mate maps in the beginning of the contig (in my case human mtDNA).
Eg, this will make it crash:
readX 177 MT 16558 37 7S2M2I10M80S = 294 -16176 GACCTGTGATCC...
readY 177 MT 16558 37 7S2M2I10M80S = 238 -16232 GACCTGTGATCC...
readZ 113 MT 16558 37 7S2M2I10M80S = 273 -16197 GACCTGTGATCC...
[END]
where a file ending like this wont crash:
readX 83 MT 16469 60 101M = 16246 -324 TGGGGGTAGCTAAAGTGAAC...
readY 147 MT 16469 60 101M = 16267 -303 TGGGGGTAGCTAAAGTGA...
readZ 147 MT 16469 60 101M = 16193 -377 TGGGGGTAGCTAAAGTGAAC...
[END]
I am running GATK v2.3-9-ge5ebf34, but the same error occurs using GATK v-2.2-3 (my previous version). I can genotype the files using UnifiedGenotyper without any problem as well.
This is the error:
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read? at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:380) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
Cheers,
Simon
Sorry to post such a simple question but I seem to be at my wits end. Base Recalibrator keeps giving me this error:
Why isn't HiSeq2000 recognized as Illumina?
Hi I got the following error with GenomeAnalysisTK-2.2-2-gf44cc4e's Base Recalibrator.
##### ERROR MESSAGE: Key 2006 is too large for dimension 2 (max is 2001)
I also ran the picard's validateSamFile to validate my BAM file and it says NO ERRORs. What exactly does this error mean? what key is it talking about? And how can I fix it? Thanks, Ashu
Dear GATK team,
Thanks a lot for the new GATK version and GATK forum!
I am trying to use GATK for yeast strains. I do not have files of known sites of SNPs/indels. I understand that the BaseRecalibrator must get such a file. Do you suggest to skip calibration and realignment, or is there another way to go here?
When running GATK, I am getting "empty" results when running BaseRecalibrator. I didn't see a solution to this when searching.
java -Xmx4g -jar /seqprg/GenomeAnalysisTK-2.4-3-g2a7af43/GenomeAnalysisTK.jar -l INFO -R /Users/bcantarel/projects/refdb/human_g1k_v37.fasta --knownSites /Users/bcantarel/projects/refdb/00-All.vcf -I Sample_cDNA405.bam -T BaseRecalibrator -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o Sample_cDNA405.grp
java.lang.IllegalStateException: recalibration tables list is empty at org.broadinstitute.sting.gatk.walkers.bqsr.RecalibrationEngine.mergeThreadLocalRecalibrationTables(RecalibrationEngine.java:209) at org.broadinstitute.sting.gatk.walkers.bqsr.RecalibrationEngine.finalizeData(RecalibrationEngine.java:175) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:508) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:131) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:123) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:283) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
This method is described to be the "First pass of the base quality score recalibration". What is the second pass? It is not mentioned anywhere, or am I looking in the wrong place? In v1.2 there were two steps, is there only one step now for bqsr? Confused, Juan
Hi,
We are working with Illumina HiSeq 2000 paired-end data and as time goes by, lanes yield more and more sequences.
We are processing data at the lane BAM level (only one read group). The procedure, among others, does BWA mapping, Indel realignment, duplicates flagging and base quality recalibration. This is, as expected, a long process to complete but clearly the base recalibration stage is the longest by far, especially when lanes contain many sequences. We are using QualityScoreCovariate, ReadGroupCovariate, ContextCovariate and CycleCovariate covariates.
For instance, we have quite big lanes :
1 lane of 140,000,000 pairs (280,000,000 reads) : ~36 hours for recalibration
1 lane of 185,000,000 pairs (370,000,000 reads) : ~48 hours for recalibration
We obviously wish to reduce this run time and I found in the following link a small chapter on the topic (at the very end of the page) : http://gatk.vanillaforums.com/discussion/44/base-quality-score-recalibrator#latest
So, we are really keen on downsampling our BAM files to reduce run time but at the same time we want our data as accurate as possible to help us for instance in the task of diminishing false positive substitutions rate. So if it is worth to wait, we wait.
Nevertheless, in the plot shown in the previous link, the x axis stops at 5,000,000 reads, where the RMSE value seems to have reached a "plateau".
1) We were thus wondering if there is a read count threshold (empirical value) above which the accuracy of the recalibration is no more improved ?
2) If such a threshold exists, I can not find the '--process_nth_locus' switch described in the link above, should I use '-dt', '-dfrac', '-dcov' options instead to downsample ?
3 ) Is the '--num_threads' working with BaseRecalibrator Walker ? Up to how many threads ?
Thanks a lot,
Best Regards,
Anthony
PS : GATK version used is v2.0-23-ge9a19be
I have used BaseRecalibrator with the -plots option, but no PDF is produced. Is there some other software that is required for this?
Hi, I am getting an error (no info given about causes unfortunately) following running BaseRecalibrator:
java -Xmx4g -jar $tool/GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I $bwa/BAM/s_1.rmdup_readgps.bam \
-R $bin/Bos_taurus.UMD3.1.66.fa \
-knownSites $bin/Bos_taurus_UMD_3.1.DBSNP.zero.ordered.bed \
-o $gatk/recal_rea/recal_data1.grp
I get output to screen of all chromosomes, positions etc followed by the error
chrX_dna:chromosome_chromosome:UMD3.1:X:1:148823899:1, chr1_dna:chromosome_chromosome:UMD3.1:1:1:158337067:1 #####ERROR------------------------------
Can you suggest any reasons for BaseRecalibrator giving up here? I understand the BED file is 0-based but it has been used successfully in the previous incarnation of BaseRecalibrator. I have tried the knownSites:mask,BED and it has no effect. I have all necessary readgroup info and index for BAM, and indexed BED.
Your help is much appreciated.
I'm trying to run the BaseRecalibrator tool on my data and am getting the following error:
INFO 14:58:17,399 HelpFormatter - --------------------------------------------------------------------------------- [33/222]
INFO 14:58:17,400 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-13-g1706365, Compiled 2012/10/12 19:21:06
INFO 14:58:17,400 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 14:58:17,400 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 14:58:17,401 HelpFormatter - Program Args: -T BaseRecalibrator -I /home/sheenams/gatk_test/LMG-206.GATKinitialrmdup.srt.bam -R /home/genetics/G
enomes/gatk-bundle/human_g1k_v37.fasta -knownSites /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf -knownSites /home/genetics/Genomes/gatk-bundl
e/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -knownSites /home/genetics/Genomes/gatk-bundle/1000G_phase1.indels.b37.vcf -o /home/sheenams/gat
k_test/LMG-206.recal_data.csv -log /home/sheenams/gatk_test/LMG-206.gatk_log
INFO 14:58:17,401 HelpFormatter - Date/Time: 2012/10/17 14:58:17
INFO 14:58:17,401 HelpFormatter - ---------------------------------------------------------------------------------
INFO 14:58:17,401 HelpFormatter - ---------------------------------------------------------------------------------
INFO 14:58:17,407 ArgumentTypeDescriptor - Dynamically determined type of /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf to be VCF
INFO 14:58:17,409 ArgumentTypeDescriptor - Dynamically determined type of /home/genetics/Genomes/gatk-bundle/Mills_and_1000G_gold_standard.indels.b3
7.sites.vcf to be VCF
INFO 14:58:17,410 ArgumentTypeDescriptor - Dynamically determined type of /home/genetics/Genomes/gatk-bundle/1000G_phase1.indels.b37.vcf to be VCF
INFO 14:58:17,414 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:58:17,463 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:58:17,479 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 14:58:17,487 RMDTrackBuilder - Loading Tribble index from disk for file /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf
WARN 14:58:17,574 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUND
ED but standard is A
INFO 14:58:17,575 RMDTrackBuilder - Loading Tribble index from disk for file /home/genetics/Genomes/gatk-bundle/Mills_and_1000G_gold_standard.indels
.b37.sites.vcf
WARN 14:58:17,589 VCFStandardHeaderLines$Standards - Repairing standard header line for field GQ because -- type disagree; header has Float but stan
dard is Integer
INFO 14:58:17,590 RMDTrackBuilder - Loading Tribble index from disk for file /home/genetics/Genomes/gatk-bundle/1000G_phase1.indels.b37.vcf
WARN 14:58:17,603 VCFHeader - Found GL format, but no PL field. As the GATK now only manages PL fields internally automatically adding a correspond
ing PL field to your VCF header
WARN 14:58:17,603 VCFStandardHeaderLines$Standards - Repairing standard header line for field AC because -- count types disagree; header has UNBOUND
ED but standard is A -- descriptions disagree; header has 'Alternate Allele Count' but standard is 'Allele count in genotypes, for each ALT allele, i
n the same order as listed'
WARN 14:58:17,603 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has INTEGER
but standard is A -- descriptions disagree; header has 'Global Allele Frequency based on AC/AN' but standard is 'Allele Frequency, for each ALT alle
le, in the same order as listed'
INFO 14:58:18,093 BaseRecalibrator - The covariates being used here:
INFO 14:58:18,093 BaseRecalibrator - ReadGroupCovariate
INFO 14:58:18,093 BaseRecalibrator - QualityScoreCovariate
INFO 14:58:18,094 BaseRecalibrator - ContextCovariate
INFO 14:58:18,094 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 14:58:18,094 BaseRecalibrator - CycleCovariate
INFO 14:58:18,136 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 14:58:18,137 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 14:58:35,886 GATKRunReport - Uploaded run statistics report to AWS S3
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Key 2002 is too large for dimension 2 (max is 2001) at org.broadinstitute.sting.utils.collections.NestedIntegerArray.put(NestedIntegerArray.java:77) at org.broadinstitute.sting.gatk.walkers.bqsr.AdvancedRecalibrationEngine.updateDataForPileupElement(AdvancedRecalibrationEngine.java:97) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:244) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:106) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
I didn't see any other questions in the forum that addressed this. Can you please guide me on how to fix this error? I'm running GATK 2.1.13.
Thanks,
Sheena
Hello,
I have a BAM from RNA Seq experiment. The reads are aligned with tophat and then I filtered secondary reads, fixed mate pairs, marked duplicates with picard and realigned with gatk. However, when I try to use BaseRecalibrator function,
-T BaseRecalibrator -nct 12 -R ./SusGenome10v69/Sus_scrofa.Sscrofa10.2.69.dna.toplevel.fa -I ./7_15.corrected.ordered.sremoved.marked.fixed.realigned.fixed.corrected.ordered.sremoved.marked.fixed.realigned.fixed.bam -knownSites ./SusGenome10v69/Sus_scrofa.vcf -o ./7_15.RecalData.grp
It stops with the error:
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Trying to clip before the start or after the end of a read at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:534) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesLeftTail(ReadClipper.java:176) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:389) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:392) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:244) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:131) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:230) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:218) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679)
The bam file validates properly with the Picard's ValidateSamFile function. I tried with gatk versions, 2.3-9 and 2.4-7 obtaining the same results.
I used the same procedure with other samples without any problem. However, I am unable to recalibrate this sample. I tried to realign the reads with different tophat versions, applied different filters and different procedures, however i am unable to use the BaseRecalibrator function with this sample.
Could you help me?
Thanks!
Hello,
sorry if i missed the same problem in other threads in the forum... but we are having trouble running BaseRecalibrator in a sample and i couldn't find the solution.
I tried many steps and here is what i've found until now:
1 - Other samples work fine
2 - Running picard ValidateSamFile in realigned.bam (after IndelRealigner) gives many erros :
2a - Mate negative strand flag does not match read negative strand flag of mate
2b - Mate alignment does not match alignment start of mate
3c - Value was put into PairInfoMap more than once. (fatal)
3 - Running BaseRecalibrator with option -L 1:428-249250621 works fine!
After the fact that -L works fine i discarded the problem in vcf files and reference file. I don't know how to go further in this investigation since GATK 1 realined.bam also gives me the errors in (2) and those error are peanuts comparing the total number of reads.
The big difference here is that we're are using bwa7.
Any ideas? Thanks!
(i'm filtering out "secondary hits" given by bwa7 and will update this thread, if it works it may be helpful in the future)
GATK output:
INFO 14:11:47,441 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:11:47,443 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36
INFO 14:11:47,443 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 14:11:47,443 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 14:11:47,447 HelpFormatter - Program Args: -nct 8 -T BaseRecalibrator -I /mnt/work/rlb/pac661825//OUT_661825.realigned.bam -R ../data/databases//1KGP/GRCh37_female_exome_mt1kg.fasta --knownSites ../data/databases//dbSNP/dbSNP_137/00-All.vcf -o /mnt/work/rlb/pac661825//OUT_661825.grp
INFO 14:11:47,447 HelpFormatter - Date/Time: 2013/03/26 14:11:47
INFO 14:11:47,447 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:11:47,447 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:11:47,458 ArgumentTypeDescriptor - Dynamically determined type of ../data/databases/dbSNP/dbSNP_137/00-All.vcf to be VCF
INFO 14:11:47,500 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:11:47,558 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 14:11:47,565 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 14:11:47,577 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 14:11:47,587 RMDTrackBuilder - Loading Tribble index from disk for file ../data/databases/dbSNP/dbSNP_137/00-All.vcf
INFO 14:11:47,704 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 8 processors available on this machine
INFO 14:11:47,745 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 14:11:47,750 GenomeAnalysisEngine - Done creating shard strategy
INFO 14:11:47,750 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 14:11:47,750 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 14:11:47,773 BaseRecalibrator - The covariates being used here:
INFO 14:11:47,773 BaseRecalibrator - ReadGroupCovariate
INFO 14:11:47,773 BaseRecalibrator - QualityScoreCovariate
INFO 14:11:47,773 BaseRecalibrator - ContextCovariate
INFO 14:11:47,774 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 14:11:47,774 BaseRecalibrator - CycleCovariate
INFO 14:11:47,776 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO 14:11:47,777 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO 14:12:18,626 ProgressMeter - 1:15956928 1.10e+06 30.0 s 28.0 s 0.5% 95.1 m 94.6 m
INFO 14:12:48,655 ProgressMeter - 1:34102053 2.70e+06 60.0 s 22.0 s 1.1% 89.0 m 88.0 m
INFO 14:13:18,685 ProgressMeter - 1:59096606 4.50e+06 90.0 s 20.0 s 1.9% 77.1 m 75.6 m
INFO 14:13:48,714 ProgressMeter - 1:103467532 5.90e+06 120.0 s 20.0 s 3.4% 58.7 m 56.7 m
INFO 14:14:18,745 ProgressMeter - 1:153234111 7.50e+06 2.5 m 20.0 s 5.0% 49.5 m 47.0 m
INFO 14:14:48,774 ProgressMeter - 1:172414433 9.30e+06 3.0 m 19.0 s 5.7% 53.1 m 50.1 m
INFO 14:15:19,054 ProgressMeter - 1:208266349 1.10e+07 3.5 m 19.0 s 6.9% 51.3 m 47.8 m
INFO 14:15:49,095 ProgressMeter - 1:247611815 1.27e+07 4.0 m 19.0 s 8.2% 49.3 m 45.2 m
INFO 14:15:56,507 GATKRunReport - Uploaded run statistics report to AWS S3
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (0) > (-1) STOP -- this should never happen -- call Mauricio! at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:537) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesLeftTail(ReadClipper.java:176) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:389) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:392) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:244) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:131) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:230) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:218) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679)
What are the BD and BI flags that get added to my bam files after base recalibration? They seem to consist of a long string of "N"s, and I'm trying to understand if that is correct.
Thanks!
Here's what I'm running:
INFO 12:18:33,096 HelpFormatter - Program Args: -T BaseRecalibrator -I /home/sheenams/gatk_test/gatk2.2/H103.GATKinitialrmdup.srt.bam -
R /home/genetics/Genomes/gatk-bundle/human_g1k_v37.fasta -knownSites /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf -cov ReadGroup
Covariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o /home/sheenams/gatk_test/gatk2.2/H103.recal_data.csv -
log /home/sheenams/gatk_test/gatk2.2/H103.gatk_log
Here's the error I'm getting
INFO 12:18:33,309 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 12:18:33,310 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 12:18:33,353 BaseRecalibrator - The covariates being used here:
INFO 12:18:33,353 BaseRecalibrator - ReadGroupCovariate
INFO 12:18:33,354 BaseRecalibrator - QualityScoreCovariate
INFO 12:18:33,354 BaseRecalibrator - ContextCovariate
INFO 12:18:33,354 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 12:18:33,354 BaseRecalibrator - CycleCovariate
INFO 12:18:33,355 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO 12:18:33,355 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:18:33,355 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO 12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO 12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 2002, 3]
INFO 12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 12:18:36,198 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read?
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:371)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:287)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:252)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Array length mismatch detected. Malformed read?
##### ERROR ------------------------------------------------------------------------------------------
I used Picards' ValidateSam script on my bam file, but it says its fine. How do I fix this error?
Thanks
I've just run the BaseRecalibrator on some whole genome sequences, and while scanning through the recalibration file, I noticed that some of the bases at the beginning and ends of reads were getting very high recalibration values:
SxaQSEQsXAP010_lane_1 6 -99 Cycle M 7.5248 416048 73563
SxaQSEQsXAP010_lane_1 6 99 Cycle M 6.7402 271864 57587
SxaQSEQsXAP010_lane_1 6 -100 Cycle M 30.1585 519622 500
SxaQSEQsXAP010_lane_1 6 100 Cycle M 30.7455 408415 343
SxaQSEQsXAP010_lane_1 7 1 Cycle M 37.0476 55736 10
SxaQSEQsXAP010_lane_1 7 2 Cycle M 9.6561 55347 5990
...
SxaQSEQsXAP010_lane_1 7 -99 Cycle M 9.3230 14040721 1640938
SxaQSEQsXAP010_lane_1 7 99 Cycle M 9.0272 10199039 1275971
SxaQSEQsXAP010_lane_1 7 -100 Cycle M 33.1557 23210317 11222
SxaQSEQsXAP010_lane_1 7 100 Cycle M 33.9099 21072616 8564
SxaQSEQsXAP010_lane_1 8 -6 Cycle M 7.2585 42164 7926
...
SxaQSEQsXAP010_lane_1 21 -98 Cycle M 22.7383 839160 4466
SxaQSEQsXAP010_lane_1 21 98 Cycle M 22.5192 716787 4012
SxaQSEQsXAP010_lane_1 21 -99 Cycle M 39.9141 872572 88
SxaQSEQsXAP010_lane_1 21 99 Cycle M 40.9464 696355 55
SxaQSEQsXAP010_lane_1 21 -100 Cycle M 38.9586 999226 126
SxaQSEQsXAP010_lane_1 21 100 Cycle M 39.2492 799184 94
SxaQSEQsXAP010_lane_1 22 -1 Cycle M 37.2879 69618 12
SxaQSEQsXAP010_lane_1 22 1 Cycle M 36.5709 108966 23
SxaQSEQsXAP010_lane_1 22 -2 Cycle M 37.7221 35509 5
SxaQSEQsXAP010_lane_1 22 2 Cycle M 37.9585 99992 15
SxaQSEQsXAP010_lane_1 22 -3 Cycle M 21.2202 62377 470
SxaQSEQsXAP010_lane_1 22 3 Cycle M 23.3286 118578 550
A possible explanation is that the aligner (novoalign) is clipping any bases which mismatch, and so there are very few mismatches at the ends and beginnings of reads. That would mean that there are actually very few errors at the beginning and ends of reads, and empirically, the measured quality is high.
However, even if this is correct, I'm wondering if I should trust the recalibration: A base which was originally marked with a quality of 6 or 7 suddenly has the possibility of getting a big boost (modulo any other covariates).
Do you have any thoughts, suggestions, or other possible explanations?
Thanks,
Kevin
Hello,
I finally managed to install gsalib and so I checked the plots coming out of the BaseRecalibrator runs I was running. Then I noticed that the plots had only the original data in it, and not the recalibrated data.
More details on my setup:
Ran BaseRecalibrator as follows
java -Xmx32G -jar /path/to/GenomeAnalysisTK.jar -R /path/to/ucsc.hg19.fasta \
-L /path/to/my/regions.bed -I joined_file.bam -knownSites \
/path/to/dbsnp.vcf -o joined_file.grp -nct 24 --plot_pdf_file result.pdf
Results are similar to the attached file.
Any idea on what I could be doing wrong?
What is the criterion (or criteria) for applying the Yates correction to the empirical base qualities in the base quality recalibration?
Thanks!
Does GATK BaseRecalibrator work with Bam files produces with the SOLID Lifescope mapper?
You show in the a base quality recalibration presentation that recalibration also should work on SOLID data. But you don't mention if it also works for Bam files produced with lifescope. BWA mapping quality is from 0-37 , Lifescope mapping quality is from 0 - 95.
I get an ArrayIndexOutOfBoundsException on the lifescope Bam files.
`##### ERROR ------------------------------------------------------------------------------------------
java.lang.ArrayIndexOutOfBoundsException: -92 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:225) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
I am using the latest version of GATK, During the Quality score recalibration I found the following error. The code was as follows: java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R ~/SCZ_data/ref_hg19/hg19sum_upper.fa --DBSNP dbsnp132.txt -I ../output.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv
later i understood that i should use BaseRecalibrator for this new version of GATK, but i am still not sure what to put in the reference file for SNPs with the -knownSites command from where to obtain these vcf files?
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I my_reads.bam \ -R resources/Homo_sapiens_assembly18.fasta \ -knownSites bundle/hg18/dbsnp_132.hg18.vcf \ -knownSites another/optional/setOfSitesToMask.vcf \ -o recal_data.grp
Can you please suggest me what should be done??
I am using GATK version 2.4.9. When trying to execute BaseRecalibrator I get this message:
I'm not specifying any of the advanced options, just an input BAM, the reference genome and a BED file for the -knownSites argument. The BAM file has been correctly validated with Picard and I've checked the BED and all stop positions are greater or equal to the start positions.
I also tried with an older version of GATK so that I could use CountCovariates, but it gave a similar error (same error with different start and stop positions).
The exact call to the method is:
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I XXXX.marked.realigned.fixed.bam -R hg18KO.fa -knownSites dbsnp130_hg18_KO.bed -o XXXX_recal_data.grp
Any help would be greatly appreciated.
Mikel
Aloha,
I am calling SNPs on an organism without a reference genome or database of known polymorphisms, so I'm trying to follow the advice posted here (and in the BaseRecalibrator documentation).
I've successfully called SNPs on the un-recalibrated .bam file, then used those SNPs to recalibrate, then called SNPs on the recalibrated .bam file. As expected, I got significantly fewer (and presumably more accurate) results.
I then used the new, reduced set of SNPs to recalibrate again. When I attempted to call SNPs on this "Round Two" recalibrated .bam file, I got the following error:
SAM/BAM file recalibrated.2.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader!
I attempted to use PicardTools ValidateSamFile and CleanSam but received the same message (as an IllegalArgumentException). I would definitely consider myself a novice in the field. Any advice you can give will be greatly appreciated.
Hello,
I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.
Any help or guidance would be greatly appreciated!
Command Used for HaplotypeCaller:
java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG
Commands Used for pre-processing (run in sequence using a Perl script):
java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf
java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam
samtools index exome.realigned.dedup
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate
java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam
Dear GATK Team,
I have recently downloaded the GATK Bundle to get the human reference genome and its associated annotations.
After the mapping step on my lane BAM files, I am planning on using IndelRealigner and BaseRecalibrator as it is explained in the "Best Practices v4".
I am always confused about which annotation file I should use for my analysis.
For the Indel realignment, in the command line arguments of RealignerTargetCreator, one have to set the '--known' switch to indicate known indel sites.
--known:indels,vcf Mills_and_1000G_gold_standard.indels.b37.sites.vcf --known:dbsnp,vcf dbsnp_135.b37.vcf
But in the annotations folder, you can also find 'dbsnp_135.b37.excluding_sites_after_129.vcf' for dbsnp (version before 1000K genomes). Depending on which one I use the target intervals files are pretty different. So I am really wondering which one should be used in my case ? Or is there any other factor that could drive me to the better choice ?
I have a similar dilemna with base recalibration, "dbsnp_135.b37.vcf" or "dbsnp_135.b37.excluding_sites_after_129.vcf" in the '-knownSites' switch ?
Thanks a lot, Best,
Anthony
Hi, I am trying to run the base recalibrator. For some of my sequences, it work perfectly, but for others, it's always crashing and giving the following error message :
INFO 16:49:21,718 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:49:21,720 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-6-gebbba25, Compiled 2013/01/08 19:29:18
INFO 16:49:21,720 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:49:21,720 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:49:21,724 HelpFormatter - Program Args: -T BaseRecalibrator -I alnSortedNoDupRGRealigned1.bam -l INFO -R 1.198.A.contig22.fa -knownSites alnSortedNoDupRGRealigned1_t0.01.vcf -o test.rcl -fixMisencodedQuals --filter_mismatching_base_and_quals
INFO 16:49:21,725 HelpFormatter - Date/Time: 2013/02/11 16:49:21
INFO 16:49:21,725 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:49:21,725 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:49:21,737 ArgumentTypeDescriptor - Dynamically determined type of alnSortedNoDupRGRealigned1_t0.01.vcf to be VCF
INFO 16:49:21,744 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:49:21,970 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 16:49:21,977 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:49:21,992 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:49:22,005 RMDTrackBuilder - Loading Tribble index from disk for file alnSortedNoDupRGRealigned1_t0.01.vcf
INFO 16:49:22,040 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 16:49:22,040 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 16:49:22,134 BaseRecalibrator - The covariates being used here:
INFO 16:49:22,134 BaseRecalibrator - ReadGroupCovariate
INFO 16:49:22,134 BaseRecalibrator - QualityScoreCovariate
INFO 16:49:22,134 BaseRecalibrator - ContextCovariate
INFO 16:49:22,135 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 16:49:22,135 BaseRecalibrator - CycleCovariate
INFO 16:49:22,137 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO 16:49:22,138 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 16:49:22,139 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 16:49:22,139 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO 16:49:22,139 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 16:49:22,140 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 16:49:22,140 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO 16:49:22,140 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 16:49:22,140 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 16:49:22,140 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1002, 3]
INFO 16:49:22,140 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO 16:49:22,141 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO 16:49:22,145 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO 16:49:22,147 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO 16:49:24,487 GATKRunReport - Uploaded run statistics report to AWS S3
java.lang.ArrayIndexOutOfBoundsException: -31 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:225) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
Here is the command line I am using: java -Xmx4g -jar /mit/sjlabrie/software/GenomeAnalysisTK-2.3-6-gebbba25/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I alnSortedNoDupRGRealigned1.bam \ -l INFO \ -R 1.198.A.contig22.fa \ -knownSites alnSortedNoDupRGRealigned1_t0.01.vcf \ -o test.rcl \ -fixMisencodedQuals \ --filter_mismatching_base_and_quals \
Any Idea what's going on ?
Thank you,
Simon
I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new BaseRecalibrator walker. According to this post the BaseRecalibrator should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the BaseRecalibrator and PrintReads steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying GATKReport but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).
Can you please help? Thanks
I cannot produce BQSR plots, although I can open the grp file with gsa.read.gatkreport.
Here's the command:
java -Xmx1g -jar $shares/GenomeAnalysisTK-2.3-6-gebbba25/GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I ./0.reorder.bam \
-R $shares/ftp.broadinstitute.org/bundle/2.3/hg19/ucsc.hg19.fasta \
-knownSites $shares/ftp.broadinstitute.org/bundle/2.3/hg19/dbsnp_137.hg19.vcf \
-BQSR ./0.reorder.bam.recal.grp \
-o ./0.reorder.bam.post_recal.grp \
--plot_pdf_file ./0.reorder.bam.post_recal.grp.pdf \
-L chr1:1-1000 \
-l DEBUG \
--intermediate_csv_file ./0.reorder.bam.post_recal.grp.csv
##### ERROR stack trace
java.lang.NullPointerException
at org.broadinstitute.sting.utils.Utils.join(Utils.java:286)
at org.broadinstitute.sting.utils.recalibration.RecalUtils.writeCSV(RecalUtils.java:450)
at org.broadinstitute.sting.utils.recalibration.RecalUtils.generateRecalibrationPlot(RecalUtils.java:394)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.generatePlots(BaseRecalibrator.java:474)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:464)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:112)
at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:97)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
It looks like the csv file is not being produced.
Thanks!
I'm attempting to use the BaseRecalibrator tool for 30-50x depth whole genome datasets with BAM files of around 100 - 150GB. However it is very computationally demanding so I'd really like to distribute the processing over many cores on our cluster. I've done this for the indel realignment process by running for each chromosome separately as described in the now retired guidelines on "Parallelism with the GATK" (I think a new version is due to be issued at some point). It's less clear, to me at least, how to do this for the BaseRecalibrator.
For example, is it possible to combine GATKReports for the recalibration data generated for separate chromosomes? Or should I run the on-the-fly recalibration with PrintReads and the -BQSR option using the recalibration data for each chromosome separately? If the latter, does it matter that for some of the smaller unplaced/unlocalized chromosomes the recalibration tables will contain values for covariates generated with only a few observations? The documentation on the Base Quality Score Recalibrator seems to suggest that the recalibration tables need to be calculated over the whole genome.
Thanks, Matt
Hello dear GATK People,
I'm failing with BaseRecalibrator from the new GATK version - my pipeline worked with the 2.1-11, below is my error message. Any quick fix or should I stick to the old version?
Ania
java.lang.IllegalArgumentException: fromIndex(402) > toIndex(101) at java.util.Arrays.rangeCheck(Unknown Source) at java.util.Arrays.fill(Unknown Source) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateKnownSites(BaseRecalibrator.java:280) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateSkipArray(BaseRecalibrator.java:259) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:239) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:287) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:252) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
.....
Hi there, I was trying to debug an error in the RScript generated after base recalibration, while running the DataProcessingPipeline.scala (run as it is). I get the following debug output
[...]
Error in file(filename, "r", blocking = TRUE) :
cannot open the connection
Calls: source ... eval.with.vis -> eval.with.vis -> gsa.read.gatkreport -> file
In addition: Warning messages:
1: In file(filename, "r", blocking = TRUE) :
cannot open file '/SAN/scratch3/sample378_TTAGGC_L004_R1_001.fastq.pre_recal.table.recal': No such file or directory
Execution halted
no file ending with "recal.table.recal" exists, but the file "recal.table" does exist. I couldn't find any step in the scala script where a ".recal" is added to "recal.table", nor a specific trait or class referring to the RScript itself, as I understand it's part of the walker BaseRecalibrator.
is this a small bug in the name handling, or am I doing something wrong somewhere?
thanks, Francesco
Hi,
I am currently working with a project where we have sequenced a library of approximately 70 bps insert sizes using 2x100 paired-end seq. While this can seem unnecessary, it can improve base qualities a lot.
I have used SeqPrep (https://github.com/jstjohn/SeqPrep) which strips adaptors and merges reads that overlap, in our case the entire read most of the times. This also boosts the base qualities, if a base was sequenced twice, the quality improves quite a bit. This way, base qualities can stretch up to 70 and over (probability of error 0.0001 x 0.0001 if both reads had Q40 at that base, it merged qual = 80). No funny business there. :)
However, this does not seem to play nicely with GATK. The realignment crashes (see below) saying the the base quals must be erroneous. In my case, but they are correct. Can I force GATK to work with these BQs? (--validation_strictness LENIENT didn't help as you can see below :)
cheers Daniel Klevebring
INFO 13:04:07,408 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:07,411 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28
INFO 13:04:07,411 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 13:04:07,411 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 13:04:07,416 HelpFormatter - Program Args: -T RealignerTargetCreator -I /scratch/3041404/P394_102.prmdup.bam -R /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/human_g1k_v37_clean.fasta -o /scratch/3041404/P394_102.realn.intervals --intervals /bubo/proj/b2010040/private/GoldenPath/NG_design/1000G_REF_picard_custom_design_target_regions_HG19.bed.interval_list --validation_strictness LENIENT
INFO 13:04:07,416 HelpFormatter - Date/Time: 2013/03/13 13:04:07
INFO 13:04:07,416 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:07,416 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:08,461 GenomeAnalysisEngine - Strictness is LENIENT
INFO 13:04:08,632 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 13:04:08,640 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 13:04:08,655 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 13:04:09,782 IntervalUtils - Processing 39772003 bp from intervals
INFO 13:04:10,001 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 13:04:10,262 GenomeAnalysisEngine - Done creating shard strategy
INFO 13:04:10,262 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 13:04:10,263 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 13:04:18,482 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/3041404/P394_102.prmdup.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 70; please see the GATK --help documentation for options related to this error
##### ERROR ------------------------------------------------------------------------------------------
HI When I run Base recabrator with the following command:
java -Xmx4g -jar /usr/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -I realignedBam.bam -R /data1/human_g1k_v37.fasta --knownSites /data1/snp132.vcf -o recalibration_report.grp
I get the following error :
INFO 07:15:53,380 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO 07:15:53,380 HttpMethodDirector - Retrying request
INFO 07:15:53,386 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO 07:15:53,387 HttpMethodDirector - Retrying request
INFO 07:15:53,393 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO 07:15:53,393 HttpMethodDirector - Retrying request
INFO 07:15:53,398 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO 07:15:53,398 HttpMethodDirector - Retrying request
INFO 07:15:53,405 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO 07:15:53,405 HttpMethodDirector - Retrying request
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-34-g07bda93):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
##### ERROR Name FeatureType Documentation
##### ERROR BCF2 VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_bcf2_BCF2Codec.html
##### ERROR BEAGLE BeagleFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_beagle_BeagleCodec.html
##### ERROR BED BEDFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broad_tribble_bed_BEDCodec.html
##### ERROR BEDTABLE TableFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_table_BedTableCodec.html
##### ERROR EXAMPLEBINARY Feature http://www.broadinstitute.org/gatk/gatkdocs/org_broad_tribble_example_ExampleBinaryCodec.html
##### ERROR GELITEXT GeliTextFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broad_tribble_gelitext_GeliTextCodec.html
##### ERROR OLDDBSNP OldDbSNPFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broad_tribble_dbsnp_OldDbSNPCodec.html
##### ERROR RAWHAPMAP RawHapMapFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_hapmap_RawHapMapCodec.html
##### ERROR REFSEQ RefSeqFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_refseq_RefSeqCodec.html
##### ERROR SAMPILEUP SAMPileupFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_sampileup_SAMPileupCodec.html
##### ERROR SAMREAD SAMReadFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_samread_SAMReadCodec.html
##### ERROR TABLE TableFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_table_TableCodec.html
##### ERROR VCF VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_vcf_VCFCodec.html
##### ERROR VCF3 VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_vcf_VCF3Codec.html
##### ERROR ------------------------------------------------------------------------------------------
Hi,
I got this error today running BaseRecalibrator:
java.lang.NullPointerException at java.util.concurrent.locks.AbstractQueuedSynchronizer.hasQueuedPredecessors(AbstractQueuedSynchronizer.java:1453) at java.util.concurrent.locks.ReentrantLock$FairSync.tryAcquire(ReentrantLock.java:240) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1158) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.PriorityBlockingQueue.take(PriorityBlockingQueue.java:244) at org.broadinstitute.sting.utils.nanoScheduler.Reducer.reduceAsMuchAsPossible(Reducer.java:121) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:510) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)
The command arguments I used are: -nct 4 -T BaseRecalibrator --intermediate_csv_file inter.csv -I realigned.bam -R Homo_sapiens.GRCh37.68.dna.chromosome.all.fasta -o recal_data.grp --plot_pdf_file recal.pdf -knownSites dbsnp_137.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -knownSites 1000G_phase1.indels.b37.vcf --disable_indel_quals
This command has previously worked with other data using the same version of GATK.
Hi,
I am working on bovine exome sequencing datasets. I have 22 animals ( the read length of 11 samples is 90 bp, others are 100 bp) . I merged all of them into a big bam file, and did Indel realignment on it. When I run BaseRecalibrator, my job was aborted on chr12. And I check the region chr12(28982297, 29984297) of my reference file with Samtools, it seems not damaged. Any suggestion?
Wanbo
net.sf.picard.PicardException: Unable to load chr12(28982297, 29984297) from /data/Wanbo/genomes/bosTau6.fasta
at net.sf.picard.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:208)
at org.broadinstitute.sting.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:173)
at org.broadinstitute.sting.gatk.datasources.providers.LocusReferenceView.initializeReferenceSequence(LocusReferenceView.java:153)
at org.broadinstitute.sting.gatk.datasources.providers.LocusReferenceView.
hello I am a new user of GATK, until now I found the answer to my questions on your forum. Following is the command "java -jar ../GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -I 9485_realignedBam.fixed.bam -R hg19.fa -knownSites dbsnp_132_hg19.vcf -o recal_data.grp -filterMBQ" from wich I get this error message :
# # # # # ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: -3
at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon (BAQ.java: 158)
at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal (BAQ.java: 246)
at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM (BAQ.java: 542)
at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM (BAQ.java: 595)
at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM (BAQ.java: 530)
at org.broadinstitute.sting.utils.baq.BAQ.baqRead (BAQ.java: 663)
at
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map (BaseRecalibrator.java: 243)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map (BaseRecalibrator.java: 112)
at
at
at
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute (NanoScheduler.java: 219)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse (TraverseReadsNano.java: 91)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse (TraverseReadsNano.java: 55)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute (LinearMicroScheduler.java: 83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute (GenomeAnalysisEngine.java: 281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute (CommandLineExecutable.java: 113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start (CommandLineProgram.java: 237)
at org.broadinstitute.sting.commandline.CommandLineProgram.start (CommandLineProgram.java: 147)
at org.broadinstitute.sting.gatk.CommandLineGATK.main (CommandLineGATK.java: 91)
# # # # # ERROR -------------------------------------------- ----------------------------------------------
# # # # # ERROR A RUNTIME ERROR GATK has occurred (version 2.3-6-gebbba25)
# # # # # ERROR
# # # # # ERROR Please visit the wiki to see if this is a known problem
# # # # # ERROR If not, please post the error, with stack trace to the forum GATK
# # # # # ERROR Visit our website and forum for extensive documentation and answers to
# # # # # ERROR Commonly asked questions http://www.broadinstitute.org/gatk
# # # # # ERROR
# # # # # ERROR MESSAGE: -3
# # # # # ERROR --------------------
I can not find it on a forum, can you explain it and help me
thank you
following on from Pallavi's question "I Don't understand what is filtering out during traversal ", can you explain why the number of processed reads ("INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads" from Pallavi's example) differs so much from the total number of reads ("INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total"). Do these numbers relate to different sets of reads?