First pass of the base quality score recalibration -- Generates recalibration table based on various user-specified covariates (such as read group, reported quality score, machine cycle, and nucleotide context).
This walker is designed to work as the first pass in a two-pass processing step. It does a by-locus traversal operating only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. This walker generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and context). Since there is a large amount of data one can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, num observations, num mismatches, empirical quality score).
Note: ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added for the user regardless of whether or not they were specified.
The input read data whose base quality scores need to be assessed.
A database of known polymorphic sites to skip over.
A GATK Report file with many tables:
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I my_reads.bam \ -R resources/Homo_sapiens_assembly18.fasta \ -knownSites bundle/hg18/dbsnp_132.hg18.vcf \ -knownSites another/optional/setOfSitesToMask.vcf \ -o recal_data.table
These Read Filters are automatically applied to the data by the Engine before processing by BaseRecalibrator.
This tool can be run in multi-threaded mode using this option.
This tool does not apply any downsampling by default.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
|Argument name(s)||Default value||Summary|
|NA||The output recalibration table file to create|
||||A database of known polymorphic sites to skip over in the recalibration algorithm|
|NA||the binary tag covariate name if using it|
|NA||One or more covariates to be used in the recalibration. Can be specified multiple times|
|45||default quality for the base deletions covariate|
|3||Size of the k-mer context to be used for base insertions and deletions|
|45||default quality for the base insertions covariate|
|2||minimum quality for the bases in the tail of the reads to be considered|
|500||The maximum cycle value permitted for the Cycle covariate|
|2||Size of the k-mer context to be used for base mismatches|
|-1||default quality for the base mismatches covariate|
|16||number of distinct quality scores in the quantized output|
||THROW_EXCEPTION||Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ|
|SET_Q_ZERO||How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS|
|false||List the available covariates and exit|
||false||Reduce memory usage in multi-threaded code at the expense of threading efficiency|
|false||Do not use the standard set of covariates, but rather just the ones listed using the -cov argument|
|false||Sort the rows in the tables of reports|
|40.0||BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets|
||false||If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
the binary tag covariate name if using it
The tag name for the binary tag covariate (if using it)
BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets
double 40.0 [ [ -? ? ] ]
One or more covariates to be used in the recalibration. Can be specified multiple times
Note that the ReadGroup and QualityScore covariates are required and do not need to be specified. Also, unless --no_standard_covs is specified, the Cycle and Context covariates are standard and are included by default. Use the --list argument to see the available covariates.
default quality for the base deletions covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is on]
byte 45 [ [ -? ? ] ]
Size of the k-mer context to be used for base insertions and deletions
The context covariate will use a context of this size to calculate its covariate value for base insertions and deletions. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
int 3 [ [ -? ? ] ]
default quality for the base insertions covariate
A default base qualities to use as a prior (reported quality) in the insertion covariate model. This parameter is used for all reads without insertion quality scores for each base. [default is on]
byte 45 [ [ -? ? ] ]
A database of known polymorphic sites to skip over in the recalibration algorithm
This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of RodBindings (VCF, Bed, etc.) for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites. Please note however that the statistics reported by the tool will not accurately reflected those sites skipped by the -XL argument.
--knownSites binds reference ordered data. This argument supports ROD files of the following types: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, GELITEXT, OLDDBSNP, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3
List the available covariates and exit
Note that the --list argument requires a fully resolved and correct command-line to work.
minimum quality for the bases in the tail of the reads to be considered
Reads with low quality bases on either tail (beginning or end) will not be considered in the context. This parameter defines the quality below which (inclusive) a tail is considered low quality
byte 2 [ [ -? ? ] ]
Reduce memory usage in multi-threaded code at the expense of threading efficiency
When you have nct > 1, BQSR uses nct times more memory to compute its recalibration tables, for efficiency purposes. If you have many covariates, and therefore are using a lot of memory, you can use this flag to safely access only one table. There may be some CPU cost, but as long as the table is really big there should be relatively little CPU costs.
The maximum cycle value permitted for the Cycle covariate
The cycle covariate will generate an error if it encounters a cycle greater than this value. This argument is ignored if the Cycle covariate is not used.
int 500 [ [ -? ? ] ]
Size of the k-mer context to be used for base mismatches
The context covariate will use a context of this size to calculate its covariate value for base mismatches. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
int 2 [ [ -? ? ] ]
default quality for the base mismatches covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is off]
byte -1 [ [ -? ? ] ]
Do not use the standard set of covariates, but rather just the ones listed using the -cov argument
The output recalibration table file to create
After the header, data records occur one per line until the end of the file. The first several items on a line are the values of the individual covariates and will change depending on which covariates were specified at runtime. The last three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, and the raw empirical quality score calculated by phred-scaling the mismatch rate. Use '/dev/stdout' to print to standard out.
number of distinct quality scores in the quantized output
BQSR generates a quantization table for quick quantization later by subsequent tools. BQSR does not quantize the base qualities, this is done by the engine with the -qq or -BQSR options. This parameter tells BQSR the number of levels of quantization to use to build the quantization table.
int 16 [ [ -? ? ] ]
If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.
This calculation is critically dependent on being able to skip over known polymorphic sites. Please be sure that you know what you are doing if you use this option.
Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
BaseRecalibrator accepts a --solid_nocall_strategy
The --solid_nocall_strategy argument is an enumerated type (SOLID_NOCALL_STRATEGY), which can have one of the following values:
How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
BaseRecalibrator accepts a --solid_recal_mode
The --solid_recal_mode argument is an enumerated type (SOLID_RECAL_MODE), which can have one of the following values:
Sort the rows in the tables of reports
GATK version 3.1-1-g07a4bf8 built at 2014/03/18 07:00:36. GTD: NA