BaseRecalibrator

First pass of the base quality score recalibration -- Generates recalibration table based on various user-specified covariates (such as read group, reported quality score, machine cycle, and nucleotide context).

Category Sequence Data Processing Tools

Traversal ReadWalker

PartitionBy READ


Overview

This walker is designed to work as the first pass in a two-pass processing step. It does a by-locus traversal operating only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. This walker generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and context). Since there is a large amount of data one can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, num observations, num mismatches, empirical quality score).

Note: ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added for the user regardless of whether or not they were specified.

Input

The input read data whose base quality scores need to be assessed.

A database of known polymorphic sites to skip over.

Output

A GATK Report file with many tables:

  1. The list of arguments
  2. The quantized qualities table
  3. The recalibration table by read group
  4. The recalibration table by quality score
  5. The recalibration table for all the optional covariates
The GATK Report is intended to be easy to read by humans or computers. Check out the documentation of the GATKReport to learn how to manipulate this table.

Examples

 java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T BaseRecalibrator \
   -I my_reads.bam \
   -R resources/Homo_sapiens_assembly18.fasta \
   -knownSites bundle/hg18/dbsnp_132.hg18.vcf \
   -knownSites another/optional/setOfSitesToMask.vcf \
   -o recal_data.table
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by BaseRecalibrator.

Parallelism options

This tool can be run in multi-threaded mode using this option.

Downsampling settings

This tool does not apply any downsampling by default.


Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

BaseRecalibrator specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Outputs
--out
 -o
NA The output recalibration table file to create
Optional Inputs
--knownSites
[] A database of known polymorphic sites to skip over in the recalibration algorithm
Optional Parameters
--binary_tag_name
 -bintag
NA the binary tag covariate name if using it
--covariate
 -cov
NA One or more covariates to be used in the recalibration. Can be specified multiple times
--deletions_default_quality
 -ddq
45 default quality for the base deletions covariate
--indels_context_size
 -ics
3 Size of the k-mer context to be used for base insertions and deletions
--insertions_default_quality
 -idq
45 default quality for the base insertions covariate
--low_quality_tail
 -lqt
2 minimum quality for the bases in the tail of the reads to be considered
--maximum_cycle_value
 -maxCycle
500 The maximum cycle value permitted for the Cycle covariate
--mismatches_context_size
 -mcs
2 Size of the k-mer context to be used for base mismatches
--mismatches_default_quality
 -mdq
-1 default quality for the base mismatches covariate
--quantizing_levels
 -ql
16 number of distinct quality scores in the quantized output
--solid_nocall_strategy
THROW_EXCEPTION Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
--solid_recal_mode
 -sMode
SET_Q_ZERO How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
Optional Flags
--list
 -ls
false List the available covariates and exit
--lowMemoryMode
false Reduce memory usage in multi-threaded code at the expense of threading efficiency
--no_standard_covs
 -noStandard
false Do not use the standard set of covariates, but rather just the ones listed using the -cov argument
--sort_by_all_columns
 -sortAllCols
false Sort the rows in the tables of reports
Advanced Parameters
--bqsrBAQGapOpenPenalty
 -bqsrBAQGOP
40.0 BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets
Advanced Flags
--run_without_dbsnp_potentially_ruining_quality
false If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--binary_tag_name / -bintag

the binary tag covariate name if using it
The tag name for the binary tag covariate (if using it)

String


--bqsrBAQGapOpenPenalty / -bqsrBAQGOP

BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets

double  40.0  [ [ -?  ? ] ]


--covariate / -cov

One or more covariates to be used in the recalibration. Can be specified multiple times
Note that the ReadGroup and QualityScore covariates are required and do not need to be specified. Also, unless --no_standard_covs is specified, the Cycle and Context covariates are standard and are included by default. Use the --list argument to see the available covariates.

String[]


--deletions_default_quality / -ddq

default quality for the base deletions covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is on]

byte  45  [ [ -?  ? ] ]


--indels_context_size / -ics

Size of the k-mer context to be used for base insertions and deletions
The context covariate will use a context of this size to calculate its covariate value for base insertions and deletions. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.

int  3  [ [ -?  ? ] ]


--insertions_default_quality / -idq

default quality for the base insertions covariate
A default base qualities to use as a prior (reported quality) in the insertion covariate model. This parameter is used for all reads without insertion quality scores for each base. [default is on]

byte  45  [ [ -?  ? ] ]


--knownSites / -knownSites

A database of known polymorphic sites to skip over in the recalibration algorithm
This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of RodBindings (VCF, Bed, etc.) for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites. Please note however that the statistics reported by the tool will not accurately reflected those sites skipped by the -XL argument.

--knownSites binds reference ordered data. This argument supports ROD files of the following types: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, GELITEXT, OLDDBSNP, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3

List[RodBinding[Feature]]  []


--list / -ls

List the available covariates and exit
Note that the --list argument requires a fully resolved and correct command-line to work.

boolean  false


--low_quality_tail / -lqt

minimum quality for the bases in the tail of the reads to be considered
Reads with low quality bases on either tail (beginning or end) will not be considered in the context. This parameter defines the quality below which (inclusive) a tail is considered low quality

byte  2  [ [ -?  ? ] ]


--lowMemoryMode / -lowMemoryMode

Reduce memory usage in multi-threaded code at the expense of threading efficiency
When you have nct > 1, BQSR uses nct times more memory to compute its recalibration tables, for efficiency purposes. If you have many covariates, and therefore are using a lot of memory, you can use this flag to safely access only one table. There may be some CPU cost, but as long as the table is really big there should be relatively little CPU costs.

boolean  false


--maximum_cycle_value / -maxCycle

The maximum cycle value permitted for the Cycle covariate
The cycle covariate will generate an error if it encounters a cycle greater than this value. This argument is ignored if the Cycle covariate is not used.

int  500  [ [ -?  ? ] ]


--mismatches_context_size / -mcs

Size of the k-mer context to be used for base mismatches
The context covariate will use a context of this size to calculate its covariate value for base mismatches. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.

int  2  [ [ -?  ? ] ]


--mismatches_default_quality / -mdq

default quality for the base mismatches covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is off]

byte  -1  [ [ -?  ? ] ]


--no_standard_covs / -noStandard

Do not use the standard set of covariates, but rather just the ones listed using the -cov argument

boolean  false


--out / -o

The output recalibration table file to create
After the header, data records occur one per line until the end of the file. The first several items on a line are the values of the individual covariates and will change depending on which covariates were specified at runtime. The last three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, and the raw empirical quality score calculated by phred-scaling the mismatch rate. Use '/dev/stdout' to print to standard out.

R File


--quantizing_levels / -ql

number of distinct quality scores in the quantized output
BQSR generates a quantization table for quick quantization later by subsequent tools. BQSR does not quantize the base qualities, this is done by the engine with the -qq or -BQSR options. This parameter tells BQSR the number of levels of quantization to use to build the quantization table.

int  16  [ [ -?  ? ] ]


--run_without_dbsnp_potentially_ruining_quality / -run_without_dbsnp_potentially_ruining_quality

If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.
This calculation is critically dependent on being able to skip over known polymorphic sites. Please be sure that you know what you are doing if you use this option.

boolean  false


--solid_nocall_strategy / -solid_nocall_strategy

Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
BaseRecalibrator accepts a --solid_nocall_strategy flag which governs how the recalibrator handles no calls in the color space tag. Unfortunately because of the reference inserted bases mentioned above, reads with no calls in their color space tag can not be recalibrated.

The --solid_nocall_strategy argument is an enumerated type (SOLID_NOCALL_STRATEGY), which can have one of the following values:

THROW_EXCEPTION
When a no call is detected throw an exception to alert the user that recalibrating this SOLiD data is unsafe. This is the default option.
LEAVE_READ_UNRECALIBRATED
Leave the read in the output bam completely untouched. This mode is only okay if the no calls are very rare.
PURGE_READ
Mark these reads as failing vendor quality checks so they can be filtered out by downstream analyses.

SOLID_NOCALL_STRATEGY  THROW_EXCEPTION


--solid_recal_mode / -sMode

How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
BaseRecalibrator accepts a --solid_recal_mode flag which governs how the recalibrator handles the reads which have had the reference inserted because of color space inconsistencies.

The --solid_recal_mode argument is an enumerated type (SOLID_RECAL_MODE), which can have one of the following values:

DO_NOTHING
Treat reference inserted bases as reference matching bases. Very unsafe!
SET_Q_ZERO
Set reference inserted bases and the previous base (because of color space alignment details) to Q0. This is the default option.
SET_Q_ZERO_BASE_N
In addition to setting the quality scores to zero, also set the base itself to 'N'. This is useful to visualize in IGV.
REMOVE_REF_BIAS
Look at the color quality scores and probabilistically decide to change the reference inserted base to be the base which is implied by the original color space instead of the reference.

SOLID_RECAL_MODE  SET_Q_ZERO


--sort_by_all_columns / -sortAllCols

Sort the rows in the tables of reports

Boolean  false


See also Guide Index | Tool Documentation Index | Support Forum

GATK version 3.2-2-gec30cee built at 2014/07/17 17:54:48. GTD: NA