# Tagged with #bqsr 3 documentation articles | 1 announcement | 37 forum discussions

When you run AnalyzeCovariates to analyze your BQSR outputs, you may encounter an error starting with this line:

org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.


The main reason why this error often occurs is simple, and so is the solution. The script depends on some external R libraries, so if you don't have them installed, the script fails. To find out what libraries are necessary and how to install them, you can refer to this tutorial.

One other common issue is that the version of ggplot2 you have installed is very recent and is not compatible with the BQSR script. If so, download this Rscript file and use it to generate the plots manually according to the instructions below.

If you have already checked that you have all the necessary libraries installed, you'll need to run the script manually in order to find out what is wrong. To new users, this can seem complicated, but it only takes these 3 simple steps to do it!

### 1. Re-run AnalyzeCovariates with these additional parameters:

• -l DEBUG (that's a lowercase L, not an uppercase i, to be clear) and
• -csv my-report.csv (where you can call the .csv file anything; this is so the intermediate csv file will be saved).

### 2. Identify the lines in the log output that says what parameters the RScript is given.

The snippet below shows you the components of the R script command line that AnalyzeCovariates uses.

INFO  18:04:55,355 AnalyzeCovariates - Generating plots file 'RTest.pdf'
DEBUG 18:04:55,672 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/utils/recalibration/BQSR.R /Users/schandra/BQSR_Testing/RTest.csv /Users/schandra/BQSR_Testing/RTest.recal /Users/schandra/BQSR_Testing/RTest.pdf
DEBUG 18:04:55,687 RScriptExecutor - Executing:
DEBUG 18:04:55,688 RScriptExecutor -   Rscript
DEBUG 18:04:55,688 RScriptExecutor -   -e
DEBUG 18:04:55,688 RScriptExecutor -   tempLibDir = '/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/Rlib.2085451458391709180';source('/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/BQSR.761775214345441497.R');
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.csv
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.recal
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.pdf


So, your full command line will be:

RScript BQSR.R RTest.csv RTest.recal RTest.pdf


• BQSR.R is the name of the script you want to run. It can be found here
• RTest.csv is the name of the original csv file output from AnalyzeCovariates.
• RTest.recal is your original recalibration file.
• RTest.pdf is the output pdf file; you can name it whatever you want.

### 3. Run the script manually with the above arguments.

For new users, the easiest way to do this is to do it from within an IDE program like RStudio. Or, you can start up R at the command line and run it that way, whatever you are comfortable with.

You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.

### Phred scale in context

In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer.

In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample.

### Phred scale in practice

In today’s sequencing output, by convention, Phred-scaled base quality scores range from 2 to 63. However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A higher score indicates a higher probability that a particular decision is correct, while conversely, a lower score indicates a higher probability that the decision is incorrect.

The Phred quality score (Q) is logarithmically related to the error probability (E).

$$Q = -10 \log E$$

So we can interpret this score as an estimate of error, where the error is e.g. the probability that the base is called incorrectly by the sequencer, but we can also interpret it as an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. Depending on how we decide to express it, we can make the following calculations:

If we want the probability of error (E), we take:

$$E = 10 ^{-\left(\frac{Q}{10}\right)}$$

And conversely, if we want to express this as the estimate of accuracy (A), we simply take

$$\begin{eqnarray} A &=& 1 - E \nonumber \ &=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \ \end{eqnarray}$$

Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.

For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error.

Phred Quality Score Error Accuracy (1 - Error)
10 1/10 = 10% 90%
20 1/100 = 1% 99%
30 1/1000 = 0.1% 99.9%
40 1/10000 = 0.01% 99.99%
50 1/100000 = 0.001% 99.999%
60 1/1000000 = 0.0001% 99.9999%

And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities.

The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically.

Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.

Detailed information about command line options for BaseRecalibrator can be found here.

## Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

• Reported quality score
• The position within the read
• The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

## Running the tools

### BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

• assigned quality score
• machine cycle producing this base
• current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

### Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:


java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I input.bam \
-BQSR recalibration_report.grp \
-o output.bam


After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

• the sum of the global difference between reported quality scores and the empirical quality
• plus the quality bin specific shift
• plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

### Miscellaneous information

• The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
• A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
• Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
• The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

## Example pre and post recalibration results

• Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
• There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

## The output of the BaseRecalibrator

• A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

### The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

• Arguments Table -- a table with all the arguments and its values
• Quantization Table
• Quality Score Table
• Covariates Table

#### Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:


#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...


#### Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:


#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...


This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.


#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762


#### Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.


#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...


#### Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.


#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...


## Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

• First do an initial round of SNP calling on your original, unrecalibrated data.
• Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
• Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

### Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

### Consequences/ Solution:

Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.

This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!

I used the following script:

java -Xmx5g -jar $gatk -T BaseRecalibrator -R$ref_dir/ucsc.hg19.fasta -I bam_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.dedupped.realigned.bam -knownSitesref_dir/dbsnp_138.hg19.vcf -knownSites $ref_dir/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites$ref_dir/1000G_phase1.indels.hg19.sites.vcf -o $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recal.grp java -Xmx5g -jar$gatk -T BaseRecalibrator -R $ref_dir/ucsc.hg19.fasta -I$bam_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.dedupped.realigned.bam -BQSR $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recal.grp -o$other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.post.recal.grp -knownSites $ref_dir/dbsnp_138.hg19.vcf -knownSites$ref_dir/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites $ref_dir/1000G_phase1.indels.hg19.sites.vcf java -Xmx5g -jar$gatk -T AnalyzeCovariates -R $ref_dir/ucsc.hg19.fasta -before$other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recal.grp -after $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.post.recal.grp -plots$other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recalibration_plots.pdf

##### ERROR MESSAGE: java.io.IOException: Input/output error

Hello,

I am working on non-human data and want to use GATK's Haplotype caller to call variants. I am using the protocol that is listed here.

So far I have reached the indel realignment stage and want to do base recalibration next. Since I do not have known sites, this page in the GATK forum advises you to do an initial round of SNP calling on original un-recalibrated data.

Has anybody experience doing this? Is this initial calling done with UnifiedGenotyper on the indel realigned BAMs, or since I want to use Haploytype Caller downstream, should I use this module instead?

Cheers, Mika

If I am not having information about known variants. Is it fine if I skip the BQSR step after indelRealignment step??

Hi, Here I have no problem :)

I ran a bqsr workflow on 2 set of data that were almost similare (same study, 2 indiv.), on 2 different machines.

Here are my observations : set1 : Dell powerEdge R820, with 128Go Ram, printReads ran in 7h hours with ~35Go ram used. set2 : Dell powerEdge R730 with 193Go ram (and many jobs already running) printReads ran in 1h20 min with ~35Go ram used.

Both sets were launched with the exact same command : -nct 20 set2 output is a little bigger (16Go vs 14Go for set1).

I'd like to know if this huge difference in duration is mostly due to materials, and if it is, which part ? CPU speed, cache, ram access ?

Thanks

Hello,

I am performing BQSR on RNA-seq data for the purpose of SNP calling. I was wondering about some issues:

1. My organism does not have a known set of SNPs. I asked this question before in the forum and accordingly, I am using a set of SNPs filtered as the input of knownsites for BQSR. In filtering, some SNPs have a tag 'SNPcluster', shall I use this file or shall I somehow filter the file so only the PASS SNPs are retained?

2. I have performed SNP calling only on a subset of my bam file because I was interested in certain chromosomes. Would that be fine if I only use this subset of SNPs and try to recalibrate only the subset of bam file not the entire bam? I am asking this question in case this can somehow create a bias for final results.

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Hi,

I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

Thanks, Alva

Hi,

I encounter the "RScript exited with 1. Run with -l DEBUG for more info" error when trying to make recalibration plots using Rscript. However, I do have all the required packages installed on my computer, including gsalib, which seems to be the problem. When I run with -l DEBUG, I get the following information:

Error in library(gsalib) : there is no package called ‘gsalib’

The thing is that I am using my computer but working in a cluster, so not directly on my computer. Should I have the packages installed in the cluster or is it fine that I've installed the packages on my computer? I'm a bit new to this, so I apologize if this question is a bit basic.

Thanks, Alva

This discussion was created from comments split from: Base Quality Score Recalibration (BQSR). To ask new questions, please post on the article linked above.

hi, What does the covariate "ContextCovariate" in BQSR mean? Is it "reference sequence context" or "read context"? Is it the same with "Dinuc" in older versions of gatk? Thanks a lot!

##### ERROR ------------------------------------------------------------------------------------------

Do you know what's happening? Could you help fix this?

Many thanks!

Fabrice

Hi there, I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step? Also what should I do when I reach to step- VQSR? Hope some GATK developers can help me on this. Thanks.

I am recalibrating the base qualities, using all the standard covariates, including machine cycle. The reads come from a 454 machine, and they have an average length of around 300 bases. Curiously, the recalibrator only issues an error if I set the --maximum_cycle_value to less than 27. And even if I set the --maximum_cycle_value to the true maximum length (1601), the RecalTable2 does not record cycles longer than 27. In DEBUG level, I get a few messages that seem to be related, but that I don't know how to interpret:

DEBUG 11:45:23,986 ReadCovariates - Keys cache miss for length 39 cache size 1 DEBUG 11:45:23,988 ReadCovariates - Keys cache miss for length 38 cache size 2 DEBUG 11:45:23,995 ReadCovariates - Keys cache miss for length 41 cache size 3 DEBUG 11:45:23,999 ReadCovariates - Keys cache miss for length 36 cache size 4 DEBUG 11:45:24,001 ReadCovariates - Keys cache miss for length 31 cache size 5 DEBUG 11:45:24,004 ReadCovariates - Keys cache miss for length 43 cache size 6 DEBUG 11:45:24,010 ReadCovariates - Keys cache miss for length 26 cache size 7 DEBUG 11:45:24,015 ReadCovariates - Keys cache miss for length 40 cache size 8 DEBUG 11:45:24,026 ReadCovariates - Keys cache miss for length 30 cache size 9 DEBUG 11:45:24,037 ReadCovariates - Keys cache miss for length 37 cache size 10 DEBUG 11:45:24,079 ReadCovariates - Keys cache miss for length 33 cache size 11 DEBUG 11:45:24,086 ReadCovariates - Keys cache miss for length 44 cache size 12 DEBUG 11:45:24,152 ReadCovariates - Keys cache miss for length 35 cache size 13 DEBUG 11:45:24,606 ReadCovariates - Keys cache miss for length 32 cache size 14 DEBUG 11:45:24,654 ReadCovariates - Keys cache miss for length 28 cache size 15 DEBUG 11:45:24,683 ReadCovariates - Keys cache miss for length 34 cache size 16 DEBUG 11:45:24,716 ReadCovariates - Keys cache miss for length 45 cache size 17 DEBUG 11:45:24,963 ReadCovariates - Keys cache miss for length 27 cache size 18 DEBUG 11:45:25,033 ReadCovariates - Keys cache miss for length 29 cache size 19 DEBUG 11:45:27,401 ReadCovariates - Keys cache miss for length 25 cache size 20 DEBUG 11:45:28,445 ReadCovariates - Keys cache miss for length 46 cache size 21 DEBUG 11:45:29,426 ReadCovariates - Keys cache miss for length 23 cache size 22 DEBUG 11:45:31,228 ReadCovariates - Keys cache miss for length 47 cache size 23

The fact is that after recalibration the low-qualitity tails get higher qualities, and I suspect that the recalibration does not work well for late cycles. I'm using GATK v3.2-2-gec30cee. This is what my command looks like:

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I realigned.bam -R ref.fa -o recal_data.table --maximum_cycle_value 1601 --run_without_dbsnp_potentially_ruining_quality -l DEBUG

One particularity of the bam file is that it's not huge. There are only about 500000 reads. Recalibration may not be accurate, but still, I wonder if it is using all the information available. There are hundreds of thousands of reads longer than 100 bases. Can anybody explain what's happening? Thanks.

Hi! I’m trying to plan my GATK pipeline and have a few questions. We currently have an assembled genome for a non-model avian species and would like to align the reads from the single individual to the genome and subsequently mine SNPs. Our end game is to design a Fluidigm SNP chip. Since we are only working with the data from a single individual, I am trying to figure out how to best minimize false positive SNP calls.

1) Prior to beginning mapping with BWA, is any pre-processing of the reads recommended? The tutorials say to use raw reads… do we not even need to remove adaptors?

2) Does GATK have any specific recommendations for non-model BQSR (i.e., when no reference set of high-quality SNPs is available)? I’ve found a couple references to this process (http://gatkforums.broadinstitute.org/discussion/3286/quality-score-recalibration-for-non-model-organisms, https://gist.github.com/brantfaircloth/4315737). One user indicated they were using SNPs with quality scores greater than 30 as their reference SNPs for BQSR, another 90. Does GATK have any other thoughts, or is this largely a judgment call?

Dear Geraldine and GATK experts,

I have attended the great Brussels workshop, and I am posting here the BQSR question I had.

I have a human exome experiment with 24-multiplexed samples per lane (Nextera libraries) where we first only did one lane of sequencing (~15x) and then added a second lane (summing up to ~30x). From what I understood reading the Best Practices, I probably don't have enough data to run a BQSR on each sample. How should I then do the BQSR step? Should I skip it altogether? Could I estimate the model parameters on one whole lane (all the samples) and then apply it separately to each sample?

And as a separate question: If I could turn back the clock, would it have been better to do 12-multiplexed samples per lane and run these two lanes of sequencing (24 samples in total) for the same amount of reads but giving me more data to do a BQSR step per sample?

Thanks a lot for your help!

I know the best practices document is on its way, but I recently came across this quite new page on how to use the -L argument as was surprised to see that:

Small Targeted Experiments The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

I knew that VQSR could not be used on small targeted experiments, but didn't know that BQSR should not be used. The Base Quality Score Recalibration page includes the note:

A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.

If I have 150Mbases of data over a set of small target intervals, does that count as a small dataset for which I should avoid using BQSR? What about 1B bases, again over a small set of intervals? What are the best practices in this case?

##### ERROR MESSAGE: SAM/BAM file SAMFileReader{xxxx.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (72) with BAQ correction factor of 8; please see the GATK --help documentation for options related to this error

Hello GATK team,

I have a question regarding the PrintReads walker. I am running it with the --BQSR engine using the command below. While my job has yet to finish, it is clear that the output bam file will be significantly larger than the input (~2x). I have not set the –emit_original_quals flag, so I expect that the original scores should be discarded. Should I still be expecting this size increase?

     java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/applyRecal_${sample}\ -jar${gatk}\
-R ${refSequence}\ -I${SCRATCH}/{sample}.realigned.bam\ -BQSR{SCRATCH}/${sample}.recal_data.table\ -o${SCRATCH}/{sample}.recalibrated.bam\ -nct 4  Thanks so much, Noushin Hello again, More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps. I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf): ##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18-r572 ##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa ##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ) ##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus) ##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2) ##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5) ##contig=<ID=1,length=195471971> ##contig=<ID=10,length=130694993> ##contig=<ID=11,length=122082543> ##contig=<ID=12,length=120129022> ##contig=<ID=13,length=120421639> ##contig=<ID=14,length=124902244> ##contig=<ID=15,length=104043685> ##contig=<ID=16,length=98207768> ##contig=<ID=17,length=94987271> ##contig=<ID=18,length=90702639> ##contig=<ID=19,length=61431566> ##contig=<ID=2,length=182113224> ##contig=<ID=3,length=160039680> ##contig=<ID=4,length=156508116> ##contig=<ID=5,length=151834684> ##contig=<ID=6,length=149736546> ##contig=<ID=7,length=145441459> ##contig=<ID=8,length=129401213> ##contig=<ID=9,length=124595110> ##contig=<ID=X,length=171031299> ##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) [0]"> ##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]"> ##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps [3]"> ##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) []"> ##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) [0]"> ##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) [200]"> ##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) [5]"> ##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) [5]"> ##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) [20]"> ##FILTER=<ID=Qual,Description="Minimum value of the QUAL field [10]"> ##FILTER=<ID=RefN,Description="Reference base is N []"> ##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered [2]"> ##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]"> ##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) [0]">  When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match". However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data. So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data. Thanks very much! Hello, I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers. I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine: INFO 15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 INFO 15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:12:10,377 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.06 INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] I've checked the indel interval lists for my samples and they do all appear to contain different intervals. However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.5-2-gf57256b): ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed ##### ERROR Please do not post this error to the GATK forum ##### ERROR ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation. ##### ERROR ------------------------------------------------------------------------------------------ Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR. My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams. Thanks very much! When I run the BQSR with GATK 2.8.1, I can't check the PG information on output bam file. BWA->Picard dedup->GATK LocalRealign->GATK BQSR Below is the output BAM file after BQSR step. @PG ID:GATK IndelRealigner VN:2.3-9-gdcdccbb CL:knownAlleles=[(RodBinding name=knownAlleles source=/nG/Reference/hgdownload.cse.ucsc.edu/goldenPath/hg19/KNOWNINDEL/Mills_and_1000G_gold_standard.indels.hg19.vcf)] targetIntervals=/nG/Data/2077/vcf1/node1/chrY/Databind/chrY.bam.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.92(1464) CL:net.sf.picard.sam.MarkDuplicates INPUT=[/nG/Data/2077/step2_makebam/node1-1.bam, /nG/Data/2077/step2_makebam/node1-2.bam, /nG/Data/2077/step2_makebam/node1-3.bam, /nG/Data/2077/step2_makebam/node1-4.bam, /nG/Data/2077/step2_makebam/node1-5.bam, /nG/Data/2077/step2_makebam/node1-6.bam] OUTPUT=/dev/stdout METRICS_FILE=/nG/Data/2077/temp/picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000000 TMP_DIR=[/nG/Data/2077/temp] QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=2000000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO CREATE_INDEX=false CREATE_MD5_FILE=false Related thread (guess) http://gatkforums.broadinstitute.org/discussion/2118/baserecalibration-prinread-don-t-create-a-header-and-don-t-obtain-oq-orignal-base-quality-in-bam Hi, Looking in the forum, I can´t see how to correct this. I am using java 1.7, sorted HG19 from UCSC and sorted HG19 dbnsp from the bundle java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I PAN001N.rmdup.bam -R HG19.fasta -knownSites dbsnp_137.hg19.sorted.vcf -o recalibration_report.grp The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: -37 at org.broadinstitute.sting.utils.BaseUtils.convertIUPACtoN(BaseUtils.java:172) at org.broadinstitute.sting.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:288) at org.broadinstitute.sting.gatk.datasources.providers.ReferenceView.getReferenceBases(ReferenceView.java:116) at org.broadinstitute.sting.gatk.datasources.providers.ReadReferenceViewProvider.getBases(ReadReferenceView.java:87) at org.broadinstitute.sting.gatk.contexts.ReferenceContext.fetchBasesFromProvider(ReferenceContext.java:145) at org.broadinstitute.sting.gatk.contexts.ReferenceContext.getBases(ReferenceContext.java:189) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateIsSNP(BaseRecalibrator.java:335) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:253) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

As a note, I validated the BAM with picard/validateSamFile, with no errors found

Do you have some ideas? Thanks!!

Hello,

I have a question regarding the BQSR param when running the UnifiedGenotyper. If I leave away that param, will there be a "default" confusion matrix applied for (single-sample) SNP calling?

Best, Cindy

I have a study including 100 samples. While most samples were sequenced in one lane, a dozen were in two lanes. I wonder if the difference in # of lanes may lead to difference in overall base quality scores after BQSR?

I have been working primarily with non-model organisms (and mostly inbred-mapping populations, but that's a topic for a different discussion). To recalibrate base qualities, I have taken the approach of running through the Indel Realignment, SNP, and INDEL calling. Then, filtering around INDELs. I use multi-sample VCFs and have taken the following approach to recalibrate base quality: I grab the top 90th percentile SNPs from all SNPs in my filtered SNP VCF file (based on ALTQ), then I pull out these top SNPs for each SAMPLE in the VCF file (in my case I usually have between 100-300 samples) and write to SEPARATE VCF files for each SAMPLE if the GQ > 90 and it's a SNP for that sample. I then use these SAMPLE HQ VCF files for the BQSR tools.

I have a simple python script for this located here

usage: GetHighQualVcfs.py [-h] -i INFILE -o OUTDIR [--ploidy PLOIDY] [--GQ GQ]
[--percentile PERCENTILE]

Split multi-sample VCFs into single sample VCFs of high quality SNPs.

optional arguments:
-h, --help            show this help message and exit
-i INFILE, --infile INFILE
Multi-sample VCF file
-o OUTDIR, --outdir OUTDIR
Directory to output HQ VCF files.
--ploidy PLOIDY       1 for haploid; 2 for diploid
--GQ GQ               Filters out variants with GQ < this limit.
--percentile PERCENTILE
Reduces to variants with ALTQ > this percentile.


When using queue for BQSR scatter/gather parellelism I've been seeing the following:

java.lang.IllegalArgumentException: Table1 188,3 not equal to 189,3
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)  I'm using gatk version: v2.4-7-g5e89f01 (I can't keep up the pace with you guys). I'm wondering if this is a know issue, and if so, if there is a workaround or a fix in later GATK versions. Cheers, Johan Hello, I am trying to test -nct parameter in printreads (bqsr). My machine has 32 cores (aws cc2.8xlarge) and input is a single BAM. So I tried "-nct 32" first, but it was almost as slow as "-nct 1". Actually any -nct >8 made it slow, so I am wondering if it's my fault or known/expected result. Thanks! I mapped data against the human reference provided in the GATK_b37_bundle resource bundle and I am now trying to run BQSR using the recommended known variant sets from the same resource bundle. Upon including the Mills_and_1000G_gold_standard.indels.b37.vcf known variant set I get the following error: ##### ERROR contig knownSites = MT / 16571 ##### ERROR contig reference = MT / 16569 The header of the Mills_and_1000G_gold_standard.indels.b37.vcf seems to the indicate that the correct 16569 bp MT version is used for the VCF file ##contig=<ID=MT,length=16569,assembly=b37> Why does the BQSR tool think that a different version of MT is used for the Mills_and_1000G_gold_standard.indels.b37.vcf ? Edit: I have the same problem with the 1000G_phase1.indels.b37.vcf from the GATK_b37_bundle. Get the same error and the MT contig seems the be the correct one from the vcf header. Only the dbsnp_137.b37.vcf is accepted by the BQSR tool without complaining about a different MT contig. Hi, I am working on a data set in which (1) multiple individuals were sequenced on the same lane, and (2) the same individuals were sequenced in multiple runs. If I get this right, we would thus want to consider both lane and run as covariates in BQSR. I have two questions related to this: 1) Which elements of the @RG header are considered by BQSR? All given there (e.g. ID, SM, LB, PI, PL)? 2) I am not sure where (in which @RG elemnt) to best incorporate run and lane info? cheers, R I cannot produce BQSR plots, although I can open the grp file with gsa.read.gatkreport. Here's the command: java -Xmx1g -jar$shares/GenomeAnalysisTK-2.3-6-gebbba25/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I ./0.reorder.bam \ -R $shares/ftp.broadinstitute.org/bundle/2.3/hg19/ucsc.hg19.fasta \ -knownSites$shares/ftp.broadinstitute.org/bundle/2.3/hg19/dbsnp_137.hg19.vcf \ -BQSR ./0.reorder.bam.recal.grp \ -o ./0.reorder.bam.post_recal.grp \ --plot_pdf_file ./0.reorder.bam.post_recal.grp.pdf \ -L chr1:1-1000 \ -l DEBUG \ --intermediate_csv_file ./0.reorder.bam.post_recal.grp.csv

##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.sting.utils.Utils.join(Utils.java:286) at org.broadinstitute.sting.utils.recalibration.RecalUtils.writeCSV(RecalUtils.java:450) at org.broadinstitute.sting.utils.recalibration.RecalUtils.generateRecalibrationPlot(RecalUtils.java:394) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.generatePlots(BaseRecalibrator.java:474) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:464) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.executive.Accumulator\$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:97) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

It looks like the csv file is not being produced.

Thanks!

Hi, I would like to perform base quality score recalibration on only the reads that have the "properly aligned" bit (0x2) set in the FLAG column of the SAM format. Ideally, I would like to use the --read_filter argument. Below is some code that does this to my satisfaction with the PrintReads walker of GATK 2 lite. However, GATK 2 lite does not support base quality score recalibration table creation. Is there any way someone could add the code to the GATK 2 full version?

I am not sure why, but the code seems to only work with the System.out.println() line.

Thanks, Winni

/* * code written by Kiran Garimella */

import net.sf.samtools.SAMRecord;

public class ProperPairFilter extends ReadFilter { @Override public boolean filterOut(SAMRecord samRecord) { System.out.println(samRecord.getProperPairFlag()); return !samRecord.getProperPairFlag(); } }

Hello,

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

• MarkDuplicates
• Count Covariates
• Table Recalibration
• Realigner Target Creator
• Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.

Mike

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

1. Remove duplicates via MarkDuplicates.
2. RealignerTargetCreator
3. IndelRealigner using the --consensusDeterminationModel USE_SW flag
4. Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!

Sagi

What are the BD and BI flags that get added to my bam files after base recalibration? They seem to consist of a long string of "N"s, and I'm trying to understand if that is correct.

Thanks!

My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as --knownSites or --known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.

Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?

This method is described to be the "First pass of the base quality score recalibration". What is the second pass? It is not mentioned anywhere, or am I looking in the wrong place? In v1.2 there were two steps, is there only one step now for bqsr? Confused, Juan

I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new BaseRecalibrator walker. According to this post the BaseRecalibrator should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the BaseRecalibrator and PrintReads steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying GATKReport but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).