# Tagged with #baserecalibrator 2 documentation articles | 3 announcements | 94 forum discussions

Created 2012-07-23 23:55:01 | Updated 2012-07-23 23:55:01 | Tags: baserecalibrator gatkdocs

A new tool has been released!

Check out the documentation at BaseRecalibrator.

Created 2012-07-23 17:12:33 | Updated 2015-07-07 01:53:27 | Tags: bqsr baserecalibrator printreads baserecalibration

Detailed information about command line options for BaseRecalibrator can be found here.

## Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

• Reported quality score
• The position within the read
• The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

## Running the tools

### BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

• assigned quality score
• machine cycle producing this base
• current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

### Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:


java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I input.bam \
-BQSR recalibration_report.grp \
-o output.bam


After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

• the sum of the global difference between reported quality scores and the empirical quality
• plus the quality bin specific shift
• plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

### Miscellaneous information

• The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
• A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
• Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
• The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

## Example pre and post recalibration results

• Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
• There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

## The output of the BaseRecalibrator

• A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

### The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

• Arguments Table -- a table with all the arguments and its values
• Quantization Table
• Quality Score Table
• Covariates Table

#### Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:


#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...


#### Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:


#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...


This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.


#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762


#### Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.


#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...


#### Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.


#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...


## Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

• First do an initial round of SNP calling on your original, unrecalibrated data.
• Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
• Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

### Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

## Base Quality Score Recalibration

• Improved the algorithm around homopolymer runs to use a "delocalized context".
• Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
• Fixed bug where the tool failed for reads that begin with insertions.
• Fixed bug in the scatter-gather functionality.
• Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

## Unified Genotyper

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
• Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
• Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
• Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
• Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
• The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
• Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
• Generalized ploidy model now handles reference calls correctly.

## Haplotype Caller

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Now requires at least 10 samples to merge variants into complex events.

## Variant Annotator

• Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

• Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
• Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
• Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

## Variant Filtration

• Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

## Variant Eval

• AlleleCount stratification now supports records with ploidy other than 2.

## Combine Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Now outputs the first non-missing QUAL, not the maximum.

## Select Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Removed the -number argument because it gave biased results.

## Validate Variants

• Added option to selectively choose particular strict validation options.
• Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
• improved the error message around unused ALT alleles.

## Somatic Indel Detector

• Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

## Miscellaneous

• Fixed raw HapMap file conversion bug in VariantsToVCF.
• Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
• Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
• Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
• Fixed bug in BCF2 writer for case where all genotypes are missing.
• Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
• Fixed bug in Phase By Transmission when there are no likelihoods present.
• Fixed bug in fasta .fai generation.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Created 2012-09-20 20:43:12 | Updated 2012-09-20 20:43:12 | Tags: bqsr baserecalibrator queue scatter-gather bug

### Consequences/ Solution:

Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.

This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!

Created 2012-08-20 18:52:48 | Updated 2012-08-23 14:11:29 | Tags: unifiedgenotyper official baserecalibrator combinevariants haplotypecaller selectvariants varianteval release-notes

## Base Quality Score Recalibration

• Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
• Implemented support for SOLiD no call strategies other than throwing an exception.
• Fixed smoothing in the BQSR bins.
• Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

## Unified Genotyper

• Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
• UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
• Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
• In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
• Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

## Haplotype Caller

• Added LowQual filter to the output when appropriate.
• Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
• Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
• Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
• Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
• Fixed bug where non-standard bases from the reference would cause errors.
• Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

• Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
• Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
• Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

## Variant Eval

• Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
• Fixed incorrect allele counting in IndelSummary evaluation.

## Combine Variants

• Now outputs the first non-MISSING QUAL, instead of the maximum.
• Now supports multi-threaded running (with the -nt argument).

## Select Variants

• Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
• No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
• If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).

## Miscellaneous

• GATK now generates a proper error when a gzipped FASTA is passed in.
• Various improvements throughout the BCF2-related code.
• Removed various parallelism bottlenecks in the GATK.
• Added support of X and = CIGAR operators to the GATK.
• Catch NumberFormatExceptions when parsing the VCF POS field.
• Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
• Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
• We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
• Added support for handling complex events in ValidateVariants.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Created 2015-09-02 03:06:31 | Updated | Tags: baserecalibrator notprimaryalignmentfilter unmappedreadfilter

After running BaseRecalibrator we got : INFO 12:01:19,574 TraversalEngine - 1493749 reads were filtered out during traversal out of 5418149 total (27.57%) INFO 12:01:19,574 TraversalEngine - -> 824329 reads (15.21% of total) failing NotPrimaryAlignmentReadFilter INFO 12:01:19,574 TraversalEngine - -> 558510 reads (10.31% of total) failing UnmappedReadFilter INFO 12:01:19,574 TraversalEngine - -> 110910 reads (2.05% of total) failing ZeroMappingQualityReadFilter

I could figure out that mapping quality zero which means they're mapped but with zero confidence, so are useless.

Does it implies that there is problem in alignment?

thanks

Created 2015-08-31 11:58:59 | Updated | Tags: baserecalibrator best-practices ploidy pooling

Dear GATK Team,

baserecalibrator is in the newest Best Practices recommendations. Does this tool support or would you recommend it's usage for samples with ploidy higher then 2 (e.g. pooled samples)?

Thanks! Bernt

Created 2015-08-13 14:07:05 | Updated | Tags: bqsr baserecalibrator read-group-effects

Hi Team,

I have a pooled dataset with 95 individuals on one lane. This I have in 95 files, having each unique readgroups like this:

@RG ID:TGCCATG SM:TGCCATG PL:ILLUMINA LB:LB PU:LB_1 @RG ID:ACCTGAT SM:ACCTGAT PL:ILLUMINA LB:LB PU:LB_1 [...]

@RG ID:LB_1 SM:MIX PL:ILLUMINA LB:LB PU:LB_1 @RG ID:LB_1 SM:MIX PL:ILLUMINA LB:LB PU:LB_1 [...]

Then I ran BQSR. 1. All original files together by using multiple times --input_file 2. All files with modified RG.

# -filterMBQ #--defaultBaseQualities 35 #--fix_misencoded_quality_scores #-fixMisencodedQuals #--allow_potentially_misencoded_quality_scores #-allowPotentiallyMisencodedQualsi

Created 2015-07-06 22:11:22 | Updated | Tags: baserecalibrator printreads quality-score

I just wanted to check if the PrintReads BAMs keep the original quality score.

Created 2015-04-24 08:31:06 | Updated | Tags: baserecalibrator plots

Hi,

I am using the latest version of GATK. But it is giving me the below error, though i have specified the argument

-o $DATPATH/recal.grp \ -plots$DATPATH/recal.grp.pdf

ERROR ------------------------------------------------------------------------------------------ ERROR A USER ERROR has occurred (version 3.3-0-g37228af): ERROR MESSAGE: Argument with name 'plots' isn't defined. ERROR ------------------------------------------------------------------------------------------ Please advise on this.

Thanks

Created 2015-04-24 06:44:34 | Updated 2015-04-24 07:15:42 | Tags: baserecalibrator knownsites

Hi,

I would like to know about the -knownSites of the BaseRecalibrator. How to get the latest available list for this? How to use this, when we want to identify the unknown variant sites ? What are the steps need to be used instead of the "Quality score recalibration-CountCovariates, TableRecalibration" which were available in the older versions of GATK.

Thanks

Created 2015-03-06 21:16:10 | Updated | Tags: baserecalibrator

Base quality may be genuinely associated with some covariates. For instance, average base quality may decrease at the end of the run; or average base quality may be genuinely different between the read groups that come from different runs. Will base quality recalibrator preserve the true quality differences in such cases?

Created 2015-03-06 15:09:48 | Updated | Tags: baserecalibrator

Hello, I am trying to run baserecalibrator on a bam file after it has completed the Split'N'Trim + ReassignMappingQuality module as documented in the RNASeq Best Practices but am getting an error.

java -jar /home/swong/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar –T BaseRecalibrator –R /home/swong/reference/hs37d5.fa -I TNBC_0019_1_BE_Whole_T3_O2KON_J00264.proj.Aligned.out.PICARD.sorted.dedup.ADDRG.SPLIT.bam –knownSites /home/swong/gatk_bundle_2.5/b37/dbsnp_137.b37.vcf –knownSites /home/swong/gatk_bundle_2.5/b37/Mills_and_1000G_gold_standard.indels.b37.vcf -o J00264_recal_data.table

##### ERROR Invalid argument value '/home/swong/gatk_bundle_2.5/b37/Mills_and_1000G_gold_standard.indels.b37.vcf' at position 9.

Not sure what I am doing wrong; I am very new to this and am trying to learn but following the tutorials exactly. Thank you very much!

Created 2015-02-25 02:12:39 | Updated | Tags: vqsr baserecalibrator haplotypecaller knownsites resources variant-recalibration

Hi, I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis. How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file? If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks, NB

Created 2015-02-12 07:12:06 | Updated | Tags: indelrealigner baserecalibrator mutect

Hello Do you recommened realign around indels and recalibrate quality score before running Mutect? Thanks!

Created 2015-01-30 16:05:13 | Updated | Tags: baserecalibrator

Hi, when I run java -jar /data/software/GATK/GenomeAnalysisTK.jar -I sample_1_marked.bam -R /data/scratch/mxiong/ref/NCBI_GRCh38/genome.fa -T BaseRecalibrator -knownSites /data/scratch/mxiong/ref/NCBI_GRCh38/All_20150114.vcf -o sample_1_BaseRecalibrator.table

I get the error: INFO 09:44:18,529 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:44:18,531 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.6-5-gba531bd, Compiled 2013/07/18 18:05:31 INFO 09:44:18,531 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:44:18,531 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 09:44:18,534 HelpFormatter - Program Args: -I sample_1_marked.bam -R /data/scratch/mxiong/ref/NCBI_GRCh38/genome.fa -T BaseRecalibrator -knownSites /data/scratch/mxiong/ref/NCBI_GRCh38/All_20150114.vcf -o sample_1_BaseRecalibrator.table INFO 09:44:18,534 HelpFormatter - Date/Time: 2015/01/30 09:44:18 INFO 09:44:18,534 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:44:18,535 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:44:18,544 ArgumentTypeDescriptor - Dynamically determined type of /data/scratch/mxiong/ref/NCBI_GRCh38/All_20150114.vcf to be VCF INFO 09:44:18,586 GenomeAnalysisEngine - Strictness is SILENT INFO 09:44:18,695 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 09:44:18,701 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 09:44:18,729 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 09:44:18,738 RMDTrackBuilder - Creating Tribble index in memory for file /data/scratch/mxiong/ref/NCBI_GRCh38/All_20150114.vcf WARN 09:44:23,631 RestStorageService - Error Response: PUT '/GATK_Run_Reports/zVNtpWFPBehiPvx4z4rZjxdxJkkPnQ0v.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1354, Content-MD5: bjETF8CVdO+JCA4f863abA==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 6e311317c09574ef89080e1ff3adda6c, Date: Fri, 30 Jan 2015 15:44:23 GMT, Authorization: AWS AKIAIMHBU7X642TCHQ2A:jnCxXiM7CxJeA+5g1YqKx7Fkp9w=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-358.14.1.el6.x86_64; amd64; en; JVM 1.7.0_25), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 79AFC335EE127933, x-amz-id-2: V5djJVO8pbxaphF/VMxfYumCS+Z/lAkI5A5jNWoDrzMtCi7xLdCPifIxyBlfuUci+7OWttT98gg=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Fri, 30 Jan 2015 15:44:23 GMT, Connection: close, Server: AmazonS3]

##### ERROR ------------------------------------------------------------------------------------------

I used Genome assembly: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38 Human dbSNP Build 142 data: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606

Do those files fit GATK analysis directly?

Created 2015-01-21 19:53:26 | Updated | Tags: baserecalibrator haplotypecaller vcf bam merge rnaseq

Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Created 2015-01-18 04:58:43 | Updated | Tags: baserecalibrator

I am trying BQSR with different known SNP files. How do I know if one BQSR run is better than the other?

Is RMSE the right criteria, ie the smaller the better?

I am calculating RMSE by summing (ObservationsAccuracyAccuracy) for one category (e.g. QualityScore for Base Substitution) in the csv generated by AnalyzeCovariates, then I divide it by total Observations and square root it to obtain RMSE. Is this the right way to calculate RMSE?

I noticed that different runs gave me different number of total Observations. Why is that?

What about Errors in the csv file? What does it mean? Can it be used to determine which BQSR run is better?

Created 2015-01-07 18:29:56 | Updated | Tags: baserecalibrator

Hello, I am trying to use BaseRecalibrator and I keep getting the error below. I am using GATK v3.3-0 and java 1.8.0_25. Here is the command line:

java -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -R genome.fa -knownSites dbsnp_138.hg19.vcf -I SP033.marked.realigned.fixed.bam -T BaseRecalibrator -o SP033.recal_data.csv -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate

I also tried to run it without specifying the -cov parameters, but I get the same error. What can I do to fix this?

NFO 13:29:07,885 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:29:07,887 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 13:29:07,887 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:29:07,887 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:29:07,891 HelpFormatter - Program Args: -R genome.fa -knownSites dbsnp_138.hg19.vcf -I SP033.marked.realigned.fixed.bam -T BaseRecalibrator -o SP033.recal_data.csv -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate INFO 13:29:07,903 HelpFormatter - Executing as Alessandro@alessandrosimac.wireless.mountsinai.org on Mac OS X 10.10.1 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17. INFO 13:29:07,903 HelpFormatter - Date/Time: 2015/01/07 13:29:07 INFO 13:29:07,903 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:29:07,904 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:29:08,246 GenomeAnalysisEngine - Strictness is SILENT INFO 13:29:08,312 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:29:08,318 SAMDataSourceSAMReaders - Initializing SAMRecords in serial WARNING: BAM index file /Users/Alessandro/Research/Work/multiple myeloma/MSSM-WES/SP033.marked.realigned.fixed.bai is older than BAM /Users/Alessandro/Research/Work/multiple myeloma/MSSM-WES/SP033.marked.realigned.fixed.bam INFO 13:29:08,348 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:29:08,516 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:29:08,519 GenomeAnalysisEngine - Done preparing for traversal INFO 13:29:08,520 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:29:08,520 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:29:08,520 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 13:29:08,611 BaseRecalibrator - The covariates being used here:
INFO 13:29:08,611 BaseRecalibrator - ReadGroupCovariate INFO 13:29:08,612 BaseRecalibrator - QualityScoreCovariate INFO 13:29:08,612 BaseRecalibrator - ContextCovariate INFO 13:29:08,612 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 13:29:08,612 BaseRecalibrator - CycleCovariate INFO 13:29:08,614 ReadShardBalancer$1 - Loading BAM index data INFO 13:29:08,615 ReadShardBalancer$1 - Done loading BAM index data INFO 13:29:09,991 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -7 at org.broadinstitute.gatk.utils.baq.BAQ.calcEpsilon(BAQ.java:185) at org.broadinstitute.gatk.utils.baq.BAQ.hmm_glocal(BAQ.java:272) at org.broadinstitute.gatk.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:553) at org.broadinstitute.gatk.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:610) at org.broadinstitute.gatk.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:536) at org.broadinstitute.gatk.utils.baq.BAQ.baqRead(BAQ.java:680) at org.broadinstitute.gatk.tools.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:486) at org.broadinstitute.gatk.tools.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:262) at org.broadinstitute.gatk.tools.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:135) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-01-06 18:01:02 | Updated | Tags: baserecalibrator errorthrowing gatk-runtime-error

Hello,

I keep getting an uninformative error message when I run BaseRecalibrator. I did add these parameters to make it get this far: --fix_misencoded_quality_scores -fixMisencodedQuals --filter_mismatching_base_and_quals. But it seems that I have reached a dead end.

Here is the command I run: java -Xmx2g -jar /usr/local/bin/gatk-2.3/GenomeAnalysisTKLite.jar -T BaseRecalibrator -R /home/ubuntu/db/hg19/ucsc.hg19.fasta -I S14694.bwa.clean.sort.dup.grp.reorder.realign.bam -o S14694.bwa.clean.sort.dup.grp.reorder.realign.recalibration_report.grp -knownSites /home/ubuntu/db/hg19/dbsnp.hg19.b141.vcf -DIQ -nct 3 --logging_level INFO --validation_strictness LENIENT --fix_misencoded_quality_scores -fixMisencodedQuals --filter_mismatching_base_and_quals

Here is the error message:

##### ERROR ------------------------------------------------------------------------------------------

Current error: INFO 13:49:09,583 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 13:49:09,583 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:49:09,583 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:49:09,588 HelpFormatter - Program Args: --log_to_file /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.log --performanceLog /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.perflog --keep_program_records -T BaseRecalibrator -R /home/mano/media/NAS1/PFlab/Mano/Genome/reference.fa -I /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.fixed.bam -o /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.after_recal_data.table --knownSites /home/mano/media/NAS1/PFlab/Mano/Fwithoutchrsorted.vcf --knownSites /home/mano/media/NAS1/PFlab/Mano/SNPs_only_withoutCHR_sort.vcf --deletions_default_quality 45 --insertions_default_quality 45 --low_quality_tail 2 --solid_nocall_strategy LEAVE_READ_UNRECALIBRATED --solid_recal_mode SET_Q_ZERO --lowMemoryMode --no_standard_covs --bqsrBAQGapOpenPenalty 30.0 INFO 13:49:09,591 HelpFormatter - Executing as mano@balrog04 on Linux 3.2.0-26-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 13:49:09,592 HelpFormatter - Date/Time: 2014/06/02 13:49:09 INFO 13:49:09,592 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:49:09,592 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:49:10,461 GenomeAnalysisEngine - Strictness is SILENT INFO 13:49:10,538 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:49:10,547 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:49:10,574 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:50:17,644 RMDTrackBuilder - Writing Tribble index to disk for file /home/mano/media/NAS1/PFlab/Mano/SNPs_only_withoutCHR_sort.vcf.idx INFO 13:50:29,885 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

Please note that .idx file is formed but only in Kilobytes

Created 2014-05-28 00:42:16 | Updated | Tags: baserecalibrator output

I've been running BaseRecalibrator for a while, and I've just realized that I have an empty file with the same name I've given to BaseRecalibrator as output (it may have been created from a previous aborted run). Will the tool write to this file when it is finished (ETA ~19 hours), or will it exit with an error? Also, are there any intermediate files created in generating the recalibration table, and if so, where should I look for them (output directory, log directory, directory from which I called GATK, ...)?

Created 2014-05-21 20:18:11 | Updated | Tags: baserecalibrator extremely-high-quality-score

-allowPotentiallyMisencodedQuals works well with IndelRealigner. But when using it with BaseRecalibrator, it still complains the extremely high quality scores.

ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
##### ERROR
#### ERROR MESSAGE: SAM/BAM file SAMFileReader{/data004/GIF_1/GIF_1/lwang/./bam/RIMMA0468_Andean.IndelRealigned.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (68) with BAQ correction factor of 4; please see the GATK --help documentation for options related to this error


Any suggestions? GATK version 3.1 is used here. Thanks!

Created 2014-05-16 21:22:48 | Updated | Tags: baserecalibrator vcf realignment

Hello

I am following the gatk best practices guide, and using the 2 stages BaseRecalibrator process. The first recalibration is going well, the second one, produces an error like described below, I find it strange since it is not mentioned in the documentation (neither the tutorials that BaseRecalibrator require a vcf input

Here is my command

java -Xmx2g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R GRCh37-lite.fa \ -I SA495-Tumor.sorted.realigned.bam \ -BQSR SA495-Tumor.sorted.realigned.grp \ -o SA495-Tumor.sorted.post_recal.grp2

here is the error message

INFO 14:03:38,425 GATKRunReport - Uploaded run statistics report to AWS S3 ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ERROR ERROR This means that one or more arguments or inputs in your command are incorrect. ERROR The error message below tells you what is the problem. ERROR ERROR If the problem is an invalid argument, please check the online documentation guide ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ERROR ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ERROR ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation. Any idea ?

Thanks

Created 2014-05-14 15:17:29 | Updated 2014-05-14 15:19:08 | Tags: bqsr baserecalibrator printreads

Hello GATK team,

I have a question regarding the PrintReads walker. I am running it with the --BQSR engine using the command below. While my job has yet to finish, it is clear that the output bam file will be significantly larger than the input (~2x). I have not set the –emit_original_quals flag, so I expect that the original scores should be discarded. Should I still be expecting this size increase?

     java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/applyRecal_${sample}\ -jar${gatk}\

##### ERROR MESSAGE: Code exception (see stack trace for error itself)

Created 2013-12-16 11:41:23 | Updated | Tags: baserecalibrator bam bwa

We have used bwa 0.7.4 aln and sampe to align illumina reads. Then used the following command java -Xmx6g -jar ~/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T BaseRecalibrator -I ~/temp/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedindelrealigned.bam -R ~/hg19/ucsc.hg19.fasta -knownSites ~/dbSNP/dbsnp_137.hg19.vcf -o ~/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedBQSR.grp Which gave the following error message

##### ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:537) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:193) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:389) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:392) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:245) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

##### ERROR MESSAGE: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D)

can you help me in this error message? Why its coming and how to rectify it? Thanks in advance Mayukh

Created 2013-11-24 09:04:01 | Updated | Tags: baserecalibrator

Hi: i met errors in my recalibration process. every time errors occur, the messages put "don't post the errors",and then i can find solution in FAQ or furom. however, this time, it tells me to post the errors. i tried to find sulotion, but failed. i am new to GATK, and know little about java.

here is the command:

nohup java -Xmx50g -Djava.io.tmpdir=/tmp\ -jar GATKdir/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R ~/data/ReferAll/sortedIndex/ucsc.hg19.sorted.fasta \ -I ~/projects/TWINS/WGC007813/WGC007813.sai.sam.dedup.realigned.bam \ -knownSites ~/data/ReferAll/bundle/dbsnp_137.hg19.sorted.vcf \ -knownSites ~/data/ReferAll/bundle/1000G_phase1.indels.hg19.sorted.vcf \ -knownSites ~/data/ReferAll/bundle/Mills_and_1000G_gold_standard.indels.hg19.sorted.vcf \ -o ~/projects/TWINS/WGC007813/WGC007813.sai.sam.dedup.realigned.bam.recal.grp \ ~/projects/TWINS/WGC007813/no.BaseRecalibrator.out & and this is the errors message! Created 2013-11-22 15:07:27 | Updated | Tags: unifiedgenotyper baserecalibrator Dear Team, We are running an exome sequencing project where we have sequenced samples on two different lanes. Aligning with bwa, we assign identical ID and SM tags, yet different PU tag for these files. This should be in line with your general recommendations, keeping the lane information available for recalibration purposes. We have then fed both files from the same sample into the base recalibration step to create a common, sample-level bam, also in accordance with recommendations previously posted on the forum. However, when calling variants we get VCFs where UnifiedGenotyper has treated the different lanes as separate samples. What are we doing wrong? Is this approach not possible after all, so an identical read group is required for each sample? Created 2013-11-12 16:16:45 | Updated | Tags: baserecalibrator I want to create plots before and after Recalibration, and getting the error below: I have checked for the package "ggplot2"which is required for generating graphs and also added the path of R script to my environment: which is confirmed by : which Rscript

/nfs/apps/R/2.15.1/bin/Rscript

##### ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.

Below is the command I am running:

/shares/jre1.7.0_40/bin/java -jar /shares/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T AnalyzeCovariates -R /shares/dbdata/human_g1k_v37.fasta -before /shares/bam_base_recalib/recal_data.table -after /shares/bam_base_recalib/post_recal_data.table -plots /shares/bam_base_recalib/BQSR.pdf

Created 2013-11-08 10:33:48 | Updated | Tags: baserecalibrator depthcoverage

Hi,

I am using GATK through the Galaxy main server to analyze variations from whole-genome re-sequencing of various samples of non-model species (nematodes worms). I would like to know whether it is possible to have with Galaxy's GATK tools a kind of pileup (base per base or intervall, like .bed) of genome indicating specifically which base where callable or not by Unified Genotyper (UG), such as "CallableLoci". The log & metrics files generated by UG in Galaxy give the general statistics of callable loci, but there is no such a file giving a detailed information of the eligibility of each base.

In the same kind of idea, I would like to get a per-locus-depth of coverage (which can partially help answering my previous question, although it does not take into account all the filters used by UG such as base quality, mapping quality, etc.). This tool is available on Galaxy. However, I am performing 3 rounds of BQSR to get my final vcf file. Shall I calculate the depth of coverage using the first BAM file before BQSR or the last recalibrated BAM file obtained in the 3rd round of BQSR? I don't think BQSR alter the coverage score, so I would say this shouldn't matter. Am I right?

Created 2013-11-01 17:56:45 | Updated | Tags: bqsr baserecalibrator

I have a study including 100 samples. While most samples were sequenced in one lane, a dozen were in two lanes. I wonder if the difference in # of lanes may lead to difference in overall base quality scores after BQSR?

Created 2013-10-25 18:32:30 | Updated | Tags: baserecalibrator variantstovcf format

I am trying to format files for input into the BaseRecalibrator and VariantsToVcf tools. Many of the links for file formats listed on the 'variant' option do not work (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_VariantsToVCF.html#--variant). I get the following message: **Not Found

Do you know what the problem is?

Created 2013-10-16 21:24:18 | Updated 2013-10-16 21:26:40 | Tags: baserecalibrator printreadswalker

I ran BaseRecalibrator on some of my realigned bam files, there were no errors reported while running it and I was to able to generate the "recal.grp" files. However, while running the PrintReads walker to generate the recalibrated BAM files, I get the following error message for some of the BAMs.

##### ERROR MESSAGE: Exception when processing alignment for BAM index WTCHG_35305_101:4:1105:10332:129428#TGTTAACT 2/2 100b aligned read.

So, I tried to validate my realigned BAM as well as the original BAM before realignment using Picard's ValidateSAMFile and I get the following:

original BAM: Mate unmapped flag does not match read unmapped flag of mate, Mate alignment does not match alignment start of mate, Mate negative strand flag does not match read negative strand flag of mate

Aditionally these errors on realigned BAMs: Mate reference index (MRNM) does not match reference index of mate, Mate not found for paired read

Do I need to worry about initial alignments? I read on the forums that using -rf MateSameStrand Filter should help me work around this, what does this filter exactly do? Any other approaches to solve this problem would be appreciated.

Created 2013-10-09 06:31:14 | Updated | Tags: baserecalibrator

After Fixmating(Picard) my Bam file I BaseRecalibrated the new Bam file. Its size jumped from 101 to 173 Gb. Is this due to the added BD and BI tags? What is the meaning of these tags? Are they used in variant calling?

Created 2013-08-26 12:52:23 | Updated | Tags: baserecalibrator error buffer

Hi, I have an error in the step BaseRecalibrator and even increasing the memory allocated to the job, I still have the same error and nothing found on previous published posts :

##### ERROR ------------------------------------------------------------------------------------------

Created 2013-02-11 21:52:27 | Updated 2013-02-11 22:08:09 | Tags: baserecalibrator

Hi, I am trying to run the base recalibrator. For some of my sequences, it work perfectly, but for others, it's always crashing and giving the following error message :

INFO 16:49:21,718 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:49:21,720 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-6-gebbba25, Compiled 2013/01/08 19:29:18 INFO 16:49:21,720 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:49:21,720 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:49:21,724 HelpFormatter - Program Args: -T BaseRecalibrator -I alnSortedNoDupRGRealigned1.bam -l INFO -R 1.198.A.contig22.fa -knownSites alnSortedNoDupRGRealigned1_t0.01.vcf -o test.rcl -fixMisencodedQuals --filter_mismatching_base_and_quals INFO 16:49:21,725 HelpFormatter - Date/Time: 2013/02/11 16:49:21 INFO 16:49:21,725 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:49:21,725 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:49:21,737 ArgumentTypeDescriptor - Dynamically determined type of alnSortedNoDupRGRealigned1_t0.01.vcf to be VCF INFO 16:49:21,744 GenomeAnalysisEngine - Strictness is SILENT INFO 16:49:21,970 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 16:49:21,977 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:49:21,992 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 16:49:22,005 RMDTrackBuilder - Loading Tribble index from disk for file alnSortedNoDupRGRealigned1_t0.01.vcf INFO 16:49:22,040 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 16:49:22,040 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 16:49:22,134 BaseRecalibrator - The covariates being used here:
INFO 16:49:22,134 BaseRecalibrator - ReadGroupCovariate INFO 16:49:22,134 BaseRecalibrator - QualityScoreCovariate INFO 16:49:22,134 BaseRecalibrator - ContextCovariate INFO 16:49:22,135 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 16:49:22,135 BaseRecalibrator - CycleCovariate INFO 16:49:22,137 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3] INFO 16:49:22,138 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 16:49:22,139 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 16:49:22,139 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3] INFO 16:49:22,139 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 16:49:22,140 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 16:49:22,140 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3] INFO 16:49:22,140 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 16:49:22,140 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 16:49:22,140 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1002, 3] INFO 16:49:22,140 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 16:49:22,141 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 16:49:22,145 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 16:49:22,147 ReadShardBalancer$1 - Done loading BAM index data for next contig INFO 16:49:24,487 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -31 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:225) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

##### ERROR ------------------------------------------------------------------------------------------

Here is the command line I am using: java -Xmx4g -jar /mit/sjlabrie/software/GenomeAnalysisTK-2.3-6-gebbba25/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I alnSortedNoDupRGRealigned1.bam \ -l INFO \ -R 1.198.A.contig22.fa \ -knownSites alnSortedNoDupRGRealigned1_t0.01.vcf \ -o test.rcl \ -fixMisencodedQuals \ --filter_mismatching_base_and_quals \

Any Idea what's going on ?

Thank you,

Simon

Created 2013-02-01 10:44:13 | Updated 2013-02-04 14:38:40 | Tags: baserecalibrator bug arrayindexoutofboundsexception

Hi,

I did search on the site and seems several have had a similar problem but I couldn't fix mine based on those. I've managed to get to this point succesfully with creating a realigned bam file with IndelRealigner. I'm using a non-model organism with self created masking list from a VCF run done before the realignment (list in bed format). I've had to use the -fixMisencodedQuals so far through out my runs. I checked the masking file so that the ranges do not go over the ranges that are specified in my list of regions to cover (-L option).

Is there anything else I would need to check still?

Command line view:

sulyba@hippu4:/fs/lustre/wrk/sulyba/stickleback_capture> java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals

INFO  12:34:10,502 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,510 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14
INFO  12:34:10,510 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  12:34:10,516 HelpFormatter - Program Args: -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals
INFO  12:34:10,516 HelpFormatter - Date/Time: 2013/02/01 12:34:10
INFO  12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,551 ArgumentTypeDescriptor - Dynamically determined type of Sites_to_Mask.bed to be BED
INFO  12:34:10,563 GenomeAnalysisEngine - Strictness is SILENT
INFO  12:34:11,309 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  12:34:11,319 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:34:11,392 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
INFO  12:34:11,660 GenomeAnalysisEngine - Processing 20350670 bp from intervals
INFO  12:34:11,672 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  12:34:11,946 BaseRecalibrator - The covariates being used here:
INFO  12:34:11,947 BaseRecalibrator -   QualityScoreCovariate
INFO  12:34:11,947 BaseRecalibrator -   ContextCovariate
INFO  12:34:11,947 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3
INFO  12:34:11,947 BaseRecalibrator -   CycleCovariate
INFO  12:34:11,952 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO  12:34:11,952 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,952 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO  12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO  12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1002, 3]
INFO  12:34:11,954 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,954 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:12,109 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 12:34:12,120 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO  12:34:12,644 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 12:34:12,645 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO  12:34:14,901 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: -2
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: -2
##### ERROR ------------------------------------------------------------------------------------------


Created 2013-01-29 21:22:05 | Updated | Tags: baserecalibrator bug

Dear GATK Team,

I am running a pipeline on several high coverage human individuals that have been mapped using bwa and processed using samtools, picard and gatk. The bam-files pass ValidateSam from picard, but when I run the bqsr step some of them fails giving a Malformed read error (using -filterMBQ does not help in this case). I tracked down the error to bamfiles that ends with a paired end read where the mate maps in the beginning of the contig (in my case human mtDNA).

Eg, this will make it crash:

readX 177 MT 16558 37 7S2M2I10M80S = 294 -16176 GACCTGTGATCC...
readY 177 MT 16558 37 7S2M2I10M80S = 238 -16232 GACCTGTGATCC...
readZ 113 MT 16558 37 7S2M2I10M80S = 273 -16197 GACCTGTGATCC...
[END]

where a file ending like this wont crash:

readX 83 MT 16469 60 101M = 16246 -324 TGGGGGTAGCTAAAGTGAAC...
readY 147 MT 16469 60 101M = 16267 -303 TGGGGGTAGCTAAAGTGA...
readZ 147 MT 16469 60 101M = 16193 -377 TGGGGGTAGCTAAAGTGAAC...
[END]

I am running GATK v2.3-9-ge5ebf34, but the same error occurs using GATK v-2.2-3 (my previous version). I can genotype the files using UnifiedGenotyper without any problem as well.

This is the error:

##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.sting.utils.Utils.join(Utils.java:286) at org.broadinstitute.sting.utils.recalibration.RecalUtils.writeCSV(RecalUtils.java:450) at org.broadinstitute.sting.utils.recalibration.RecalUtils.generateRecalibrationPlot(RecalUtils.java:394) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.generatePlots(BaseRecalibrator.java:474) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:464) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:97) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) It looks like the csv file is not being produced. Thanks! Created 2013-01-23 14:58:26 | Updated | Tags: baserecalibrator solid lifescope Does GATK BaseRecalibrator work with Bam files produces with the SOLID Lifescope mapper? You show in the a base quality recalibration presentation that recalibration also should work on SOLID data. But you don't mention if it also works for Bam files produced with lifescope. BWA mapping quality is from 0-37 , Lifescope mapping quality is from 0 - 95. I get an ArrayIndexOutOfBoundsException on the lifescope Bam files. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: -92 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:225) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNanoTraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ---------------------------------------------- Created 2013-01-21 09:34:05 | Updated 2013-01-22 17:15:48 | Tags: baserecalibrator arrayindexoutofboundsexception hello I am a new user of GATK, until now I found the answer to my questions on your forum. Following is the command "java -jar ../GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -I 9485_realignedBam.fixed.bam -R hg19.fa -knownSites dbsnp_132_hg19.vcf -o recal_data.grp -filterMBQ" from wich I get this error message : # # # # # ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: -3 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon (BAQ.java: 158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal (BAQ.java: 246) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM (BAQ.java: 542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM (BAQ.java: 595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM (BAQ.java: 530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead (BAQ.java: 663) at at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map (BaseRecalibrator.java: 243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map (BaseRecalibrator.java: 112) at at at at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute (NanoScheduler.java: 219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse (TraverseReadsNano.java: 91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse (TraverseReadsNano.java: 55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute (LinearMicroScheduler.java: 83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute (GenomeAnalysisEngine.java: 281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute (CommandLineExecutable.java: 113) at org.broadinstitute.sting.commandline.CommandLineProgram.start (CommandLineProgram.java: 237) at org.broadinstitute.sting.commandline.CommandLineProgram.start (CommandLineProgram.java: 147) at org.broadinstitute.sting.gatk.CommandLineGATK.main (CommandLineGATK.java: 91) # # # # # ERROR -------------------------------------------- ---------------------------------------------- # # # # # ERROR A RUNTIME ERROR GATK has occurred (version 2.3-6-gebbba25) # # # # # ERROR # # # # # ERROR Please visit the wiki to see if this is a known problem # # # # # ERROR If not, please post the error, with stack trace to the forum GATK # # # # # ERROR Visit our website and forum for extensive documentation and answers to # # # # # ERROR Commonly asked questions http://www.broadinstitute.org/gatk # # # # # ERROR # # # # # ERROR MESSAGE: -3 # # # # # ERROR --------------------  I can not find it on a forum, can you explain it and help me thank you Created 2012-12-07 12:53:34 | Updated | Tags: baserecalibrator pdf I have used BaseRecalibrator with the -plots option, but no PDF is produced. Is there some other software that is required for this? Created 2012-12-04 19:37:24 | Updated | Tags: baserecalibrator I'm attempting to use the BaseRecalibrator tool for 30-50x depth whole genome datasets with BAM files of around 100 - 150GB. However it is very computationally demanding so I'd really like to distribute the processing over many cores on our cluster. I've done this for the indel realignment process by running for each chromosome separately as described in the now retired guidelines on "Parallelism with the GATK" (I think a new version is due to be issued at some point). It's less clear, to me at least, how to do this for the BaseRecalibrator. For example, is it possible to combine GATKReports for the recalibration data generated for separate chromosomes? Or should I run the on-the-fly recalibration with PrintReads and the -BQSR option using the recalibration data for each chromosome separately? If the latter, does it matter that for some of the smaller unplaced/unlocalized chromosomes the recalibration tables will contain values for covariates generated with only a few observations? The documentation on the Base Quality Score Recalibrator seems to suggest that the recalibration tables need to be calculated over the whole genome. Thanks, Matt Created 2012-11-22 10:26:18 | Updated | Tags: baserecalibrator error Hi, I got this error today running BaseRecalibrator: ##### ERROR stack trace java.lang.NullPointerException at java.util.concurrent.locks.AbstractQueuedSynchronizer.hasQueuedPredecessors(AbstractQueuedSynchronizer.java:1453) at java.util.concurrent.locks.ReentrantLockFairSync.tryAcquire(ReentrantLock.java:240) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1158) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.PriorityBlockingQueue.take(PriorityBlockingQueue.java:244) at org.broadinstitute.sting.utils.nanoScheduler.Reducer.reduceAsMuchAsPossible(Reducer.java:121) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:510) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)

The command arguments I used are: -nct 4 -T BaseRecalibrator --intermediate_csv_file inter.csv -I realigned.bam -R Homo_sapiens.GRCh37.68.dna.chromosome.all.fasta -o recal_data.grp --plot_pdf_file recal.pdf -knownSites dbsnp_137.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -knownSites 1000G_phase1.indels.b37.vcf --disable_indel_quals

This command has previously worked with other data using the same version of GATK.

Created 2012-11-21 17:13:54 | Updated | Tags: baserecalibrator

What is the criterion (or criteria) for applying the Yates correction to the empirical base qualities in the base quality recalibration?

Thanks!

Created 2012-11-16 22:13:42 | Updated | Tags: baserecalibrator

Hello dear GATK People,

I'm failing with BaseRecalibrator from the new GATK version - my pipeline worked with the 2.1-11, below is my error message. Any quick fix or should I stick to the old version?

Ania

##### ERROR stack trace

java.lang.IllegalArgumentException: fromIndex(402) > toIndex(101) at java.util.Arrays.rangeCheck(Unknown Source) at java.util.Arrays.fill(Unknown Source) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateKnownSites(BaseRecalibrator.java:280) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateSkipArray(BaseRecalibrator.java:259) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:239) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:287) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:252) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

.....

##### ERROR ------------------------------------------------------------------------------------------

Created 2012-11-06 20:23:45 | Updated 2012-11-06 20:27:53 | Tags: baserecalibrator malformedreadfilter

Here's what I'm running:

INFO  12:18:33,096 HelpFormatter - Program Args: -T BaseRecalibrator -I /home/sheenams/gatk_test/gatk2.2/H103.GATKinitialrmdup.srt.bam -
R /home/genetics/Genomes/gatk-bundle/human_g1k_v37.fasta -knownSites /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf -cov ReadGroup
Covariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o /home/sheenams/gatk_test/gatk2.2/H103.recal_data.csv -
log /home/sheenams/gatk_test/gatk2.2/H103.gatk_log


Here's the error I'm getting

INFO  12:18:33,309 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  12:18:33,353 BaseRecalibrator - The covariates being used here:
INFO  12:18:33,354 BaseRecalibrator -   QualityScoreCovariate
INFO  12:18:33,354 BaseRecalibrator -   ContextCovariate
INFO  12:18:33,354 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3
INFO  12:18:33,354 BaseRecalibrator -   CycleCovariate
INFO  12:18:33,355 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO  12:18:33,355 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,355 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 2002, 3]
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:36,198 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Array length mismatch detected. Malformed read?
##### ERROR ------------------------------------------------------------------------------------------


I used Picards' ValidateSam script on my bam file, but it says its fine. How do I fix this error?

Thanks

Created 2012-11-01 21:03:46 | Updated 2012-11-05 20:20:30 | Tags: baserecalibrator

Hi I got the following error with GenomeAnalysisTK-2.2-2-gf44cc4e's Base Recalibrator.

##### ERROR MESSAGE: Key 2006 is too large for dimension 2 (max is 2001)


I also ran the picard's validateSamFile to validate my BAM file and it says NO ERRORs. What exactly does this error mean? what key is it talking about? And how can I fix it? Thanks, Ashu

Created 2012-10-31 16:19:49 | Updated 2012-10-31 17:32:50 | Tags: baserecalibrator dataprocessingpipeline rscript

Hi there, I was trying to debug an error in the RScript generated after base recalibration, while running the DataProcessingPipeline.scala (run as it is). I get the following debug output

 [...]
Error in file(filename, "r", blocking = TRUE) :
cannot open the connection
Calls: source ... eval.with.vis -> eval.with.vis -> gsa.read.gatkreport -> file
1: In file(filename, "r", blocking = TRUE) :
cannot open file '/SAN/scratch3/sample378_TTAGGC_L004_R1_001.fastq.pre_recal.table.recal': No such file or directory
Execution halted


no file ending with "recal.table.recal" exists, but the file "recal.table" does exist. I couldn't find any step in the scala script where a ".recal" is added to "recal.table", nor a specific trait or class referring to the RScript itself, as I understand it's part of the walker BaseRecalibrator.

is this a small bug in the name handling, or am I doing something wrong somewhere?

thanks, Francesco

Created 2012-10-31 11:29:09 | Updated | Tags: baserecalibrator

Hi,

I am working on bovine exome sequencing datasets. I have 22 animals ( the read length of 11 samples is 90 bp, others are 100 bp) . I merged all of them into a big bam file, and did Indel realignment on it. When I run BaseRecalibrator, my job was aborted on chr12. And I check the region chr12(28982297, 29984297) of my reference file with Samtools, it seems not damaged. Any suggestion?

Wanbo

##### ERROR MESSAGE: Unable to load chr12(28982297, 29984297) from /data/Wanbo/genomes/bosTau6.fasta

Created 2012-10-31 01:51:31 | Updated 2012-10-31 22:20:05 | Tags: baserecalibrator commandlinegatk phone-home

HI When I run Base recabrator with the following command:

java -Xmx4g -jar /usr/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -I realignedBam.bam  -R /data1/human_g1k_v37.fasta --knownSites /data1/snp132.vcf -o recalibration_report.grp


I get the following error :

INFO  07:15:53,380 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO  07:15:53,380 HttpMethodDirector - Retrying request
INFO  07:15:53,386 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO  07:15:53,387 HttpMethodDirector - Retrying request
INFO  07:15:53,393 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO  07:15:53,393 HttpMethodDirector - Retrying request
INFO  07:15:53,398 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO  07:15:53,398 HttpMethodDirector - Retrying request
INFO  07:15:53,405 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: Unrecognized SSL message, plaintext connection?
INFO  07:15:53,405 HttpMethodDirector - Retrying request
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-34-g07bda93):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR
##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
##### ERROR          Name        FeatureType   Documentation
##### ERROR ------------------------------------------------------------------------------------------


Created 2012-10-17 22:03:44 | Updated 2012-10-17 22:09:50 | Tags: baserecalibrator

I'm trying to run the BaseRecalibrator tool on my data and am getting the following error:

INFO 14:58:17,399 HelpFormatter - --------------------------------------------------------------------------------- [33/222] INFO 14:58:17,400 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-13-g1706365, Compiled 2012/10/12 19:21:06 INFO 14:58:17,400 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:58:17,400 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:58:17,401 HelpFormatter - Program Args: -T BaseRecalibrator -I /home/sheenams/gatk_test/LMG-206.GATKinitialrmdup.srt.bam -R /home/genetics/G enomes/gatk-bundle/human_g1k_v37.fasta -knownSites /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf -knownSites /home/genetics/Genomes/gatk-bundl e/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -knownSites /home/genetics/Genomes/gatk-bundle/1000G_phase1.indels.b37.vcf -o /home/sheenams/gat k_test/LMG-206.recal_data.csv -log /home/sheenams/gatk_test/LMG-206.gatk_log INFO 14:58:17,401 HelpFormatter - Date/Time: 2012/10/17 14:58:17 INFO 14:58:17,401 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:58:17,401 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:58:17,407 ArgumentTypeDescriptor - Dynamically determined type of /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf to be VCF INFO 14:58:17,409 ArgumentTypeDescriptor - Dynamically determined type of /home/genetics/Genomes/gatk-bundle/Mills_and_1000G_gold_standard.indels.b3 7.sites.vcf to be VCF INFO 14:58:17,410 ArgumentTypeDescriptor - Dynamically determined type of /home/genetics/Genomes/gatk-bundle/1000G_phase1.indels.b37.vcf to be VCF INFO 14:58:17,414 GenomeAnalysisEngine - Strictness is SILENT INFO 14:58:17,463 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:58:17,479 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 14:58:17,487 RMDTrackBuilder - Loading Tribble index from disk for file /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf WARN 14:58:17,574 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUND ED but standard is A INFO 14:58:17,575 RMDTrackBuilder - Loading Tribble index from disk for file /home/genetics/Genomes/gatk-bundle/Mills_and_1000G_gold_standard.indels .b37.sites.vcf WARN 14:58:17,589 VCFStandardHeaderLines$Standards - Repairing standard header line for field GQ because -- type disagree; header has Float but stan dard is Integer INFO 14:58:17,590 RMDTrackBuilder - Loading Tribble index from disk for file /home/genetics/Genomes/gatk-bundle/1000G_phase1.indels.b37.vcf WARN 14:58:17,603 VCFHeader - Found GL format, but no PL field. As the GATK now only manages PL fields internally automatically adding a correspond ing PL field to your VCF header WARN 14:58:17,603 VCFStandardHeaderLines$Standards - Repairing standard header line for field AC because -- count types disagree; header has UNBOUND ED but standard is A -- descriptions disagree; header has 'Alternate Allele Count' but standard is 'Allele count in genotypes, for each ALT allele, i n the same order as listed' WARN 14:58:17,603 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has INTEGER but standard is A -- descriptions disagree; header has 'Global Allele Frequency based on AC/AN' but standard is 'Allele Frequency, for each ALT alle le, in the same order as listed' INFO 14:58:18,093 BaseRecalibrator - The covariates being used here:
INFO 14:58:18,093 BaseRecalibrator - ReadGroupCovariate INFO 14:58:18,093 BaseRecalibrator - QualityScoreCovariate INFO 14:58:18,094 BaseRecalibrator - ContextCovariate INFO 14:58:18,094 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 14:58:18,094 BaseRecalibrator - CycleCovariate INFO 14:58:18,136 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 14:58:18,137 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 14:58:35,886 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

I didn't see any other questions in the forum that addressed this. Can you please guide me on how to fix this error? I'm running GATK 2.1.13.

Thanks,

Sheena

Created 2012-10-06 18:59:15 | Updated 2013-01-07 19:49:26 | Tags: bqsr baserecalibrator printreads

What are the BD and BI flags that get added to my bam files after base recalibration? They seem to consist of a long string of "N"s, and I'm trying to understand if that is correct.

Thanks!

Created 2012-09-17 23:38:11 | Updated 2012-09-17 23:38:11 | Tags: baserecalibrator

I've just run the BaseRecalibrator on some whole genome sequences, and while scanning through the recalibration file, I noticed that some of the bases at the beginning and ends of reads were getting very high recalibration values:

SxaQSEQsXAP010_lane_1             6  -99             Cycle          M                    7.5248        416048     73563
SxaQSEQsXAP010_lane_1             6  99              Cycle          M                    6.7402        271864     57587
SxaQSEQsXAP010_lane_1             6  -100            Cycle          M                   30.1585        519622       500
SxaQSEQsXAP010_lane_1             6  100             Cycle          M                   30.7455        408415       343
SxaQSEQsXAP010_lane_1             7  1               Cycle          M                   37.0476         55736        10
SxaQSEQsXAP010_lane_1             7  2               Cycle          M                    9.6561         55347      5990
...
SxaQSEQsXAP010_lane_1             7  -99             Cycle          M                    9.3230      14040721   1640938
SxaQSEQsXAP010_lane_1             7  99              Cycle          M                    9.0272      10199039   1275971
SxaQSEQsXAP010_lane_1             7  -100            Cycle          M                   33.1557      23210317     11222
SxaQSEQsXAP010_lane_1             7  100             Cycle          M                   33.9099      21072616      8564
SxaQSEQsXAP010_lane_1             8  -6              Cycle          M                    7.2585         42164      7926
...
SxaQSEQsXAP010_lane_1            21  -98             Cycle          M                   22.7383        839160      4466
SxaQSEQsXAP010_lane_1            21  98              Cycle          M                   22.5192        716787      4012
SxaQSEQsXAP010_lane_1            21  -99             Cycle          M                   39.9141        872572        88
SxaQSEQsXAP010_lane_1            21  99              Cycle          M                   40.9464        696355        55
SxaQSEQsXAP010_lane_1            21  -100            Cycle          M                   38.9586        999226       126
SxaQSEQsXAP010_lane_1            21  100             Cycle          M                   39.2492        799184        94
SxaQSEQsXAP010_lane_1            22  -1              Cycle          M                   37.2879         69618        12
SxaQSEQsXAP010_lane_1            22  1               Cycle          M                   36.5709        108966        23
SxaQSEQsXAP010_lane_1            22  -2              Cycle          M                   37.7221         35509         5
SxaQSEQsXAP010_lane_1            22  2               Cycle          M                   37.9585         99992        15
SxaQSEQsXAP010_lane_1            22  -3              Cycle          M                   21.2202         62377       470
SxaQSEQsXAP010_lane_1            22  3               Cycle          M                   23.3286        118578       550


A possible explanation is that the aligner (novoalign) is clipping any bases which mismatch, and so there are very few mismatches at the ends and beginnings of reads. That would mean that there are actually very few errors at the beginning and ends of reads, and empirically, the measured quality is high.

However, even if this is correct, I'm wondering if I should trust the recalibration: A base which was originally marked with a quality of 6 or 7 suddenly has the possibility of getting a big boost (modulo any other covariates).

Do you have any thoughts, suggestions, or other possible explanations?

Thanks,

Kevin

Created 2012-09-06 08:52:07 | Updated 2012-09-06 14:31:48 | Tags: baserecalibrator

Hi, I am getting an error (no info given about causes unfortunately) following running BaseRecalibrator:

java -Xmx4g -jar $tool/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I$bwa/BAM/s_1.rmdup_readgps.bam \
-R $bin/Bos_taurus.UMD3.1.66.fa \ -knownSites$bin/Bos_taurus_UMD_3.1.DBSNP.zero.ordered.bed \
-o \$gatk/recal_rea/recal_data1.grp


I get output to screen of all chromosomes, positions etc followed by the error

chrX_dna:chromosome_chromosome:UMD3.1:X:1:148823899:1, chr1_dna:chromosome_chromosome:UMD3.1:1:1:158337067:1 #####ERROR------------------------------


Can you suggest any reasons for BaseRecalibrator giving up here? I understand the BED file is 0-based but it has been used successfully in the previous incarnation of BaseRecalibrator. I have tried the knownSites:mask,BED and it has no effect. I have all necessary readgroup info and index for BAM, and indexed BED.

Created 2012-09-04 20:33:32 | Updated 2013-01-07 20:43:56 | Tags: baserecalibrator platform

Sorry to post such a simple question but I seem to be at my wits end. Base Recalibrator keeps giving me this error:

##### ERROR ------------------------------------------------------------------------------------------

Why isn't HiSeq2000 recognized as Illumina?

Created 2012-08-13 18:49:39 | Updated 2012-10-03 19:30:14 | Tags: bqsr baserecalibrator

This method is described to be the "First pass of the base quality score recalibration". What is the second pass? It is not mentioned anywhere, or am I looking in the wrong place? In v1.2 there were two steps, is there only one step now for bqsr? Confused, Juan

Created 2012-08-10 01:08:49 | Updated 2012-08-10 01:13:06 | Tags: bqsr baserecalibrator gatk2

I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new BaseRecalibrator walker. According to this post the BaseRecalibrator should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the BaseRecalibrator and PrintReads steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying GATKReport but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).

Created 2012-08-09 13:19:19 | Updated 2012-08-09 13:19:19 | Tags: baserecalibrator knownsites

I am using the latest version of GATK, During the Quality score recalibration I found the following error. The code was as follows: java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R ~/SCZ_data/ref_hg19/hg19sum_upper.fa --DBSNP dbsnp132.txt -I ../output.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv

##### ERROR MESSAGE: Could not find walker with name: CountCovariates

later i understood that i should use BaseRecalibrator for this new version of GATK, but i am still not sure what to put in the reference file for SNPs with the -knownSites command from where to obtain these vcf files?

java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I my_reads.bam \ -R resources/Homo_sapiens_assembly18.fasta \ -knownSites bundle/hg18/dbsnp_132.hg18.vcf \ -knownSites another/optional/setOfSitesToMask.vcf \ -o recal_data.grp

Can you please suggest me what should be done??

Created 2012-08-03 11:49:07 | Updated 2012-08-03 11:52:04 | Tags: baserecalibrator

Hi,

We are working with Illumina HiSeq 2000 paired-end data and as time goes by, lanes yield more and more sequences.

We are processing data at the lane BAM level (only one read group). The procedure, among others, does BWA mapping, Indel realignment, duplicates flagging and base quality recalibration. This is, as expected, a long process to complete but clearly the base recalibration stage is the longest by far, especially when lanes contain many sequences. We are using QualityScoreCovariate, ReadGroupCovariate, ContextCovariate and CycleCovariate covariates.

For instance, we have quite big lanes :

1 lane of 140,000,000 pairs (280,000,000 reads) : ~36 hours for recalibration

1 lane of 185,000,000 pairs (370,000,000 reads) : ~48 hours for recalibration

We obviously wish to reduce this run time and I found in the following link a small chapter on the topic (at the very end of the page) : http://gatk.vanillaforums.com/discussion/44/base-quality-score-recalibrator#latest

So, we are really keen on downsampling our BAM files to reduce run time but at the same time we want our data as accurate as possible to help us for instance in the task of diminishing false positive substitutions rate. So if it is worth to wait, we wait.

Nevertheless, in the plot shown in the previous link, the x axis stops at 5,000,000 reads, where the RMSE value seems to have reached a "plateau".

1) We were thus wondering if there is a read count threshold (empirical value) above which the accuracy of the recalibration is no more improved ?

2) If such a threshold exists, I can not find the '--process_nth_locus' switch described in the link above, should I use '-dt', '-dfrac', '-dcov' options instead to downsample ?

3 ) Is the '--num_threads' working with BaseRecalibrator Walker ? Up to how many threads ?

Thanks a lot,

Best Regards,

Anthony

PS : GATK version used is v2.0-23-ge9a19be

Created 2012-07-26 05:07:14 | Updated 2012-10-19 16:51:54 | Tags: indelrealigner baserecalibrator knownsites

Dear GATK team,

Thanks a lot for the new GATK version and GATK forum!

I am trying to use GATK for yeast strains. I do not have files of known sites of SNPs/indels. I understand that the BaseRecalibrator must get such a file. Do you suggest to skip calibration and realignment, or is there another way to go here?