Tagged with #BQSR
4 documentation articles | 1 announcement | 60 forum discussions

Created 2015-08-26 20:08:46 | Updated 2016-04-30 02:21:28 | Tags: bqsr merge

Comments (1)

It is fairly common to have multiple read groups for a sample, either from sequencing multiple libraries or from spreading a library across multiple lanes. It seems this causes a lot of confusion, and people often tell us they're not sure how to organize the data for the pre-processing steps or how to feed the data into HaplotypeCaller.

Well, there are several options for organizing the processing. We have a fairly detailed FAQ article that describes our preferred workflow for pre-processing data from multiplexed sequencing and multi-library designs. But in this article we describe at a simpler level what are the main two options depending on how you want to provide the analysis ready BAM files to the variant caller.

To produce a combined per-sample bam file to feed to HaplotypeCaller (most common)

The simplest thing to do is to input all the bam files that belong to that sample, either at the MarkDuplicates step, the Indel Realignment step or at the BQSR step. The choice depends mostly on how deep the coverage is. High depth means a lot of data to process at the same time, which slows down Indel Realignment. This is because Indel Realignment ignores all read group information and simply processes all reads together. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one handy per-sample file.

Note: We do not require the PU field in the RG, however, BQSR will consider the PU field over all other fields.

To produce a separate bam file for each read group (less common)

Another option is to keep all the bam files separate until variant calling, and then input them to Haplotype Caller together. You can do this by simply running Indel Realignment and BQSR on each of the bams separately. You can then input all of the bams into HaplotypeCaller at once. This works even if you want to run HaplotypeCaller in GVCF mode, which can only be done on a single sample at a time. As long as the SM tags are identical, HaplotypeCaller will recognize that it's a single-sample run. This is because the GATK engine will merge the data before presenting it to the HaplotypeCaller tool, so HaplotypeCaller does not know nor care whether the data came from many files or one file.

Note: If you input many bam files into Indel Realigner, the default output is one bam file. However, you can output one bam file for each input bam file by using -nWayOut.

Created 2014-06-12 22:52:09 | Updated 2015-04-26 00:23:41 | Tags: bqsr dependencies rscript analyzecovariates

Comments (19)

When you run AnalyzeCovariates to analyze your BQSR outputs, you may encounter an error starting with this line:

org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.

The main reason why this error often occurs is simple, and so is the solution. The script depends on some external R libraries, so if you don't have them installed, the script fails. To find out what libraries are necessary and how to install them, you can refer to this tutorial.

One other common issue is that the version of ggplot2 you have installed is very recent and is not compatible with the BQSR script. If so, download this Rscript file and use it to generate the plots manually according to the instructions below.

If you have already checked that you have all the necessary libraries installed, you'll need to run the script manually in order to find out what is wrong. To new users, this can seem complicated, but it only takes these 3 simple steps to do it!

1. Re-run AnalyzeCovariates with these additional parameters:

  • -l DEBUG (that's a lowercase L, not an uppercase i, to be clear) and
  • -csv my-report.csv (where you can call the .csv file anything; this is so the intermediate csv file will be saved).

2. Identify the lines in the log output that says what parameters the RScript is given.

The snippet below shows you the components of the R script command line that AnalyzeCovariates uses.

INFO  18:04:55,355 AnalyzeCovariates - Generating plots file 'RTest.pdf' 
DEBUG 18:04:55,672 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/utils/recalibration/BQSR.R /Users/schandra/BQSR_Testing/RTest.csv /Users/schandra/BQSR_Testing/RTest.recal /Users/schandra/BQSR_Testing/RTest.pdf 
DEBUG 18:04:55,687 RScriptExecutor - Executing: 
DEBUG 18:04:55,688 RScriptExecutor -   Rscript 
DEBUG 18:04:55,688 RScriptExecutor -   -e 
DEBUG 18:04:55,688 RScriptExecutor -   tempLibDir = '/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/Rlib.2085451458391709180';source('/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/BQSR.761775214345441497.R'); 
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.csv 
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.recal 
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.pdf 

So, your full command line will be:

RScript BQSR.R RTest.csv RTest.recal RTest.pdf

Please note:

3. Run the script manually with the above arguments.

For new users, the easiest way to do this is to do it from within an IDE program like RStudio. Or, you can start up R at the command line and run it that way, whatever you are comfortable with.

Created 2014-06-05 16:10:25 | Updated 2014-06-05 17:55:23 | Tags: vqsr bqsr phred quality-scores

Comments (2)

You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.

Phred scale in context

In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer.

In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample.

Phred scale in practice

In today’s sequencing output, by convention, Phred-scaled base quality scores range from 2 to 63. However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A higher score indicates a higher probability that a particular decision is correct, while conversely, a lower score indicates a higher probability that the decision is incorrect.

The Phred quality score (Q) is logarithmically related to the error probability (E).

$$ Q = -10 \log E $$

So we can interpret this score as an estimate of error, where the error is e.g. the probability that the base is called incorrectly by the sequencer, but we can also interpret it as an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. Depending on how we decide to express it, we can make the following calculations:

If we want the probability of error (E), we take:

$$ E = 10 ^{-\left(\frac{Q}{10}\right)} $$

And conversely, if we want to express this as the estimate of accuracy (A), we simply take

$$ \begin{eqnarray} A &=& 1 - E \nonumber \ &=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \ \end{eqnarray} $$

Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.

For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error.

Phred Quality Score Error Accuracy (1 - Error)
10 1/10 = 10% 90%
20 1/100 = 1% 99%
30 1/1000 = 0.1% 99.9%
40 1/10000 = 0.01% 99.99%
50 1/100000 = 0.001% 99.999%
60 1/1000000 = 0.0001% 99.9999%

And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities.

The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically.

Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.

Created 2012-07-23 17:12:33 | Updated 2016-01-15 18:08:32 | Tags: bqsr baserecalibrator printreads baserecalibration

Comments (53)

Detailed information about command line options for BaseRecalibrator can be found here.


The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

  • Reported quality score
  • The position within the read
  • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for (sophisticated) users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools


Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

  • read group the read belongs to
  • assigned quality score
  • machine cycle producing this base
  • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

  • the sum of the global difference between reported quality scores and the empirical quality
  • plus the quality bin specific shift
  • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

  • The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
  • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
  • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
  • The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

  • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
  • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

The output of the BaseRecalibrator

  • A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

  • Arguments Table -- a table with all the arguments and its values
  • Quantization Table
  • ReadGroup Table
  • Quality Score Table
  • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31


The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

  • First do an initial round of SNP calling on your original, unrecalibrated data.
  • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
  • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

Created 2012-09-20 20:43:12 | Updated 2012-09-20 20:43:12 | Tags: bqsr baserecalibrator queue scatter-gather bug

Comments (0)

We have discovered a bug that seriously impacts the results of BQSR/ BaseRecalibrator when it is run with the scatter-gather functionality of Queue. To be clear, the bug does NOT affect BaseRecalibrator runs performed "normally", i.e. WITHOUT Queue's scatter-gather.

Consequences/ Solution:

Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.

This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!

Created 2016-04-20 11:04:13 | Updated | Tags: bqsr mouse enu

Comments (2)


I am currently processing some aligned reads to discover novel variants from an ENU mutagenesis project. All the samples are Bl6 mice, with a few exceptions.

A previous forum has discussed about using BQSR on mice sequences (gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes). Moreover, there's a research paper that fed in dbSNP database into their BQSR algorithm for their ENU project (ncbi.nlm.nih.gov/pmc/articles/PMC4623266/.)

From what I understand about BQSR, the recalibration method considers variants not found in well annotated variants as "errors" (is this error modelling?). I'm just wondering for such a project where novel mutations are to be expected, one might overestimate the errors if one were to feed in known variants into the learning algorithm (especially if the mutation load is high)?

Thanks in advance

Created 2016-04-11 04:59:55 | Updated | Tags: bqsr mouse mm10

Comments (1)


I've had good luck running BQSR on E. coli and C. remanei in the past, but mouse is giving me some trouble. It works if I remove the "--fix_misencoded_quality_scores," but I'm not sure if that is very helpful, as I thought that was the point of running BQSR. I have formatted the mm10 reference as suggested, with the chromosomes in the correct order and the chrom labels as just numbers without "chr", but I keep getting the same error. For the known sites vcf, I just use a vcf generated from the same, non-calibrated bam file, generated with the program Lofreq.

Thanks in advance for any help. Here is my input and error message:

$ java -jar /usr/local/packages/GATK/2.6-4/GenomeAnalysisTK.jar -T BaseRecalibrator -R /home13/jpreston/genomes/sorted_mm10/sorted_mm10_2.fa -I /home9/anniep/dnaseq/bowtie/242_normal_sorted.bam -knownSites /home9/anniep/dnaseq/bowtie/242_normal_sorted_KS.vcf --fix_misencoded_quality_scores -o /home9/anniep/dnaseq/bowtie/242_normal_sorted_recal.table
INFO 21:42:19,395 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:42:19,397 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.6-4-g3e5ff60, Compiled 2013/06/24 14:48:56 INFO 21:42:19,397 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 21:42:19,397 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 21:42:19,401 HelpFormatter - Program Args: -T BaseRecalibrator -R /home13/jpreston/genomes/sorted_mm10/sorted_mm10_2.fa -I /home9/anniep/dnaseq/bowtie/242_normal_sorted.bam -knownSites /home9/anniep/dnaseq/bowtie/242_normal_sorted_KS.vcf --fix_misencoded_quality_scores -o /home9/anniep/dnaseq/bowtie/242_normal_sorted_recal.table INFO 21:42:19,401 HelpFormatter - Date/Time: 2016/04/10 21:42:19 INFO 21:42:19,401 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:42:19,402 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:42:19,413 ArgumentTypeDescriptor - Dynamically determined type of /home9/anniep/dnaseq/bowtie/242_normal_sorted_KS.vcf to be VCF INFO 21:42:19,982 GenomeAnalysisEngine - Strictness is SILENT INFO 21:42:20,069 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 21:42:20,076 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 21:42:20,134 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 21:42:20,168 RMDTrackBuilder - Loading Tribble index from disk for file /home9/anniep/dnaseq/bowtie/242_normal_sorted_KS.vcf WARN 21:42:20,288 RMDTrackBuilder - Index file /home9/anniep/dnaseq/bowtie/242_normal_sorted_KS.vcf.idx is out of date (index older than input file), deleting and updating the index file INFO 21:42:20,394 RMDTrackBuilder - Creating Tribble index in memory for file /home9/anniep/dnaseq/bowtie/242_normal_sorted_KS.vcf INFO 21:42:21,123 RMDTrackBuilder - Writing Tribble index to disk for file /home9/anniep/dnaseq/bowtie/242_normal_sorted_KS.vcf.idx INFO 21:42:29,567 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 21:42:29,571 GenomeAnalysisEngine - Done preparing for traversal INFO 21:42:29,571 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 21:42:29,571 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 21:42:29,596 BaseRecalibrator - The covariates being used here:
INFO 21:42:29,596 BaseRecalibrator - ReadGroupCovariate INFO 21:42:29,596 BaseRecalibrator - QualityScoreCovariate INFO 21:42:29,596 BaseRecalibrator - ContextCovariate INFO 21:42:29,597 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 21:42:29,597 BaseRecalibrator - CycleCovariate INFO 21:42:29,600 ReadShardBalancer$1 - Loading BAM index data INFO 21:42:29,600 ReadShardBalancer$1 - Done loading BAM index data WARN 21:42:30,549 RestStorageService - Error Response: PUT '/GATK_Run_Reports/CDOcM34kcITMDEM7GEsb3sgg7vCkk1ry.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 988, Content-MD5: d3m5ghuffx5ZqbXFYdU8eQ==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 7779b9821b9f7f1e59a9b5c561d53c79, Date: Mon, 11 Apr 2016 04:42:29 GMT, Authorization: AWS AKIAIMHBU7X642TCHQ2A:43FosbVvbsP2X/SqzsJ7PhY1w80=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-358.23.2.el6.x86_64; amd64; en; JVM 1.7.0_80), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: FE8C7B3005828024, x-amz-id-2: euWnPAZr6BuQp05KFLPCkdLpzrcwXMCEWm3Tlfjk2lGbgP89RENyKG387IdN+YcAXsa58zB7zzk=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Mon, 11 Apr 2016 04:42:29 GMT, Connection: close, Server: AmazonS3]

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.6-4-g3e5ff60):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
ERROR ------------------------------------------------------------------------------------------

Created 2016-03-23 10:05:44 | Updated | Tags: bqsr indels quality-score

Comments (11)

Hi the Team,

I am analyzing some small target PCR-based deep sequencing data. The target region is 264067 bp, median of interval average coverage is around 5000x. I did indel realignment and base quality recalibration in preprocess. After BQSR, I found that for almost all reads, the BI quality (round 55) is higher than BD quality (round 45).

In the following variants calling step, we found much more deletions than insertions.

May I know that it is a normal behavior for the data having higher BI quality than BD quality?

Created 2016-01-15 16:27:38 | Updated | Tags: indelrealigner bqsr commandlinegatk knownsites

Comments (2)

Dear GATK team,

I'd like to learn what files I should use for indel realignment and BQSR from hg38 bundle? (I read the manual on this topic -- https://broadinstitute.org/gatk/guide/article?id=1247 -- but just would like to be sure):

1) Am I right that for indel realignment I should use Mills_and_1000G_gold_standard.indels.hg38.vcf and 1000G_phase1.snps.high_confidence.hg38.vcf.gz ?

2) Am I right that for BQSR I should use Mills_and_1000G_gold_standard.indels.hg38.vcf , 1000G_phase1.snps.high_confidence.hg38.vcf.gz , and dbsnp_144.hg38.vcf ?

3) Are there any other files with known sites I should use for indel realignment and BQSR?

Created 2016-01-11 21:13:28 | Updated | Tags: bqsr variant-calling

Comments (1)


I am trying to figure something out in GATK, namely, BQSR. I am working an organism for which there are no known SNPs, so I ran the haplotype caller on uncalibrated data (realigned_reads.bam and raw_variants.vcf). Then I ran BQSR using the raw_variants.vcf file as the knownSites. According to my understanding, since the BQSR would SKIP re-calibrating SNPs from the knownSites file, the qualities for this variant in the before file (realigned_reads.bam) should be the same after re-calibration (recal_reads.bam). But as you can see from the color of the nucleotides, they are indeed lower QS (much lower) after recalibration. This is the source of my confusion. According to the docs, this site should have been skipped. Any clarification would be greatly appreciated.

Thank you.

Created 2015-12-04 14:12:52 | Updated | Tags: bqsr dbsnp

Comments (1)

We are planing to work on GRCh38, and BQSR requires dbSNP VCF as input. The latest one is v146 in GRCh38 here ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b146_GRCh38p2/VCF/

There are two files in dbSNP, one is ALL_20151104.vcf (3.2G) and another is common_all_20151104.vcf (0.9G). Which one should we use? Should we just take the file as is, or do we need some filtering (say MAF, multiallielic SNP, etc)

Created 2015-09-24 07:33:27 | Updated | Tags: bqsr

Comments (1)

Hi! I have a question about the BQSR quality strings, those output as BD:Z, BI:Z and BQ:Z in the final BAM file. I would like to decrease the footprint of our final analysis files. I was thinking to remove those quality strings after the variant calling (since they are required for variant calling). I read in one of your posts that it could be possible to retrieve those quality strings back if one has the recalibration tables by just using printreads option? Could you confirm that this is the case?

Created 2015-09-22 12:51:29 | Updated | Tags: bqsr queue scala

Comments (4)


I am trying to speed up BaseRecalibrator by using GATK Queue. However, it does not seem to play well with scatter-gather. When I run it with the -BQSR option, it fails with an error message which seems to apply to GATK Queue itself and not (at least not directly) to something in my script.

My script is just a wrapper:

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._

class MyBaseRecalibrator extends QScript {

    var referenceFile: File = _

    var bamFile: File = _

    @Input(shortName="L", required=false)
    var targetIntervals: File = _

    @Argument(shortName="interval_padding", required=false)
    var intervalPadding: Int = _

    var knownSites: List[File] = Nil

    @Input(shortName="BQSR", required=false)
    var bqsrFile: File = _

    var outFile: File = _

    var scatterCount: Int = _

    def script() {

        val br = new BaseRecalibrator

        br.R = referenceFile
        br.I :+= bamFile

        br.L :+= targetIntervals
        br.interval_padding = intervalPadding

        br.BQSR = bqsrFile

        br.knownSites = knownSites

        br.o = outFile

        br.scatterCount = scatterCount
        br.memoryLimit = 8




When the BQSR option is left out, everything works great, but when it is included, the error message is as follows:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
    at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$$anonfun$intervalFilesExist$1.apply(GATKScatterFunction.scala:90)
    at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$$anonfun$intervalFilesExist$1.apply(GATKScatterFunction.scala:90)
    at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
    at scala.collection.immutable.List.exists(List.scala:84)
    at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$class.intervalFilesExist(GATKScatterFunction.scala:90)
    at org.broadinstitute.gatk.queue.extensions.gatk.ContigScatterFunction.intervalFilesExist(ContigScatterFunction.scala:35)
    at org.broadinstitute.gatk.queue.extensions.gatk.ContigScatterFunction.scatterCount(ContigScatterFunction.scala:37)
    at org.broadinstitute.gatk.queue.function.scattergather.ScatterGatherableFunction$class.generateFunctions(ScatterGatherableFunction.scala:131)
    at org.broadinstitute.gatk.queue.extensions.gatk.BaseRecalibrator.generateFunctions(BaseRecalibrator.scala:10)
    at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$fillGraph$1.apply(QGraph.scala:182)
    at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$fillGraph$1.apply(QGraph.scala:179)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at org.broadinstitute.gatk.queue.engine.QGraph.fillGraph(QGraph.scala:179)
    at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:140)
    at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:170)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
    at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------
INFO  11:26:09,842 QCommandLine - Shutting down jobs. Please wait...

None of the lines seem to refer to my code. Any clues?

Thanks, Michael

Created 2015-09-17 21:56:03 | Updated | Tags: bqsr best-practices bam gatk

Comments (1)


I recently re-generated my realigned bam files (realigned.bam) from the GATK Best Practices, but I encountered an error when I validated the bam files using the validation tool (ValidateSamFile.jar) in Picard:

ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate alignment does not match alignment start of mate ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate negative strand flag does not match read negative strand flag of mate


What exactly is going on? Where in the pipeline could this have occurred? I do not have the intermediary files so I cannot diagnose this accurately. Before I started the GATK Best Practices, my bam files had no problems; when I validated them I did not get any error messages. Does this mean that I have to regenerate the files again and start from the very beginning?

I tried to use the Picard tool FixMateInformation.jar, but then I got an error message regarding the index files. I tried deleting the bam index for one bam file and then recreating it to see if there was a problem with the indexes, but doing so did not resolve the issue. So it seems that something went wrong with the bam files at some earlier step.

Has this error been encountered before? Not sure how to proceed next, except restart everything.

Best, Alva

Created 2015-08-24 09:35:24 | Updated | Tags: unifiedgenotyper vqsr bqsr haplotypecaller

Comments (1)

I'm a bit uncertain as to the optimal pipeline for calling variants. I've sequenced a population sample of ~200 at high coverage ~30X, with no prior information on nucleotide variation.

The most rigorous pipeline would seem to be:

  1. Call variants with UG on 'raw' (realigned) bams.
  2. Extract out high-confidence variants (high QUAL, high DP, not near indels or repeats, high MAF)
  3. Perform BQSR using the high-confidence variants.
  4. Call variants with HaplotypeCaller on recalibrated bams.
  5. Perform VQSR using high-confidence variants.
  6. Any other hard filters.

Is this excessive? Does using HaplotypeCaller negate the use of *QSR? Is it worthwhile performing VQSR if BQSR hasn't been done? Otherwise I'm just running HaplotyperCaller on un-recalibrated bams, and then hard-filtering.

Created 2015-08-17 16:20:21 | Updated 2015-08-17 16:22:55 | Tags: bqsr

Comments (7)

When I have a known rate of 1 SNP every 30 basepairs and my reads (only counting the 'good' ones, used in BQSR) have 1e10 basepairs in total. So, I have 1e10 / 30 = 3.3e8 SNPs on my reads. Can I say that I expect BQSR to find 3.3e8 + epsilon errors when not masking any known SNPs?

This is connected to creating an own set of known SNPs for BQSR. I so far have the feeling from the reports, that the error rate is too high and my set should be bigger. Using the known rate of SNPs in my data as comparison to the Observations/Errors in the report and table from recalibration. So having the optimal set, I would get 3.3e8 + epsilon Observations and epsilon Errors. Am I right?

Created 2015-08-13 14:07:05 | Updated | Tags: bqsr baserecalibrator read-group-effects

Comments (4)

Hi Team,

I have a pooled dataset with 95 individuals on one lane. This I have in 95 files, having each unique readgroups like this:


I ran AddOrReplaceReadgroups on these sets, so I had readgroups like this:


Then I ran BQSR.

  1. All original files together by using multiple times --input_file
  2. All files with modified RG.

In the log I get: INFO 21:01:44,195 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.32 s, 50 of 95 in 0.32 s / 0.01 m (154.11 tasks/s). 45 remaining with est. completion in 0.29 s / 0.00 m

Surprisingly after running the second Recalibration and Report generation like in the best practices, I get the EXACT same results (PDF)! The only thing that is different is the timestamp on the first page ;)

On the page 'Overall error rates by event type' it states the ReadGroup LB_1 for both runs.

Did I miss that BQSR is not RG sensitive anymore, but PU sensitive?

Best, Alexander

Created 2015-07-27 07:48:57 | Updated | Tags: bqsr knownsites

Comments (1)

Hello, I have been using GATK for a while now, but until now I've been analyzing Human samples and the current analysis is of Arabidopsis Thaliana data. Since I do not have databases of known indels and SNPs for this algorithm, I am following the suggested workflow without known sites. I only have 2 samples in the analysis, a W.T parental strain and a sample which consists of a pool of 50 plants that underwent EMS mutagenesis. This treatment causes a large number of mutations, when each of the 50 plants in the pool can present different variations and the goal in the experiment is to find the one strong common homozygous mutation to the mutated plants, which is not present in the parental strain. Since the data is a bit different than any other data that I had worked with, I would like to know if the standard workflow (running indel realignment without known sites, running HC, filtering the high confidence SNPs and than BQSR and HC again) is also recommended in this case and if so, should I apply any different cutoffs to obtain the high confidence SNPs set? Should I use the variants found in both samples to create the high confidence SNPs file? (since the mutagenesis sample will consist of a lot of mutation with a wide range of frequencies, that will not be present in the parental strain at all)

Thank you very much, Olga karinsky

Created 2015-07-22 05:07:12 | Updated | Tags: bqsr baserecalibrator best-practices gatk

Comments (5)


I am running GATK Best Practices again and just wanted to assure myself that I am proceeding correctly. I remember reading somewhere on here that BQSR performs best when given the whole genome to work with (similar to VQSR in that respect). Better to just run using the whole genome rather than giving it single chromosomes. I couldn't really find this comment/recommendation when I tried to look for it again, so wanted to see what the final word was on this. How should one proceed?

Thanks, Alva

Created 2015-07-08 08:42:53 | Updated | Tags: bqsr non-human

Comments (1)

Hello, I would like to call variants on non human genome (S.Cerevisiae). I understood that I should perform These steps:

  1. Indel realignment
  2. HC to have SNPs data base
  3. variant filtration to get the more confident variants
  4. running BQSR with the vriants as data base.
  5. running HC on the BQSR data

my questions are: How should I perform the filtration of the variants in 3? do you have any recommendations? I would like to find variants only for one sample. should I use more samples for creating the variant data base, or one sample can be enough (I have 7 samples of that project, and I can use them)?


Created 2015-07-01 11:21:10 | Updated | Tags: bqsr

Comments (1)


I have run the BQSR but I only included two databases for the -known parameter; 1000G_phase1.indels Mills_and_1000g_gold_standard_indels

I realise I should have included the dbsnp_138 database in this list. Is it necessary to re-run with all three databases? How much difference will it make to the variant calling analysis?

Created 2015-06-26 11:26:50 | Updated | Tags: bqsr knownsites rnaseq

Comments (1)

Hi, i read the entry about how to do BQSR without a knownSNP file and have some uncertainties how to apply it to RNAseq data.

I am actually calling SNPs from RNA-seq data on a draft genome of a non-model organism and were wondering what best practice might be to do so (must sound like a nightmare to you working with human data :smile: ).

I can think of following workflow for each of the RNA-seq samples:

1. best practice SNPcalling for RNA reads with HaplotypeCaller
2. filter variants for "high quality" (-window 25 -cluster 3 --filterExpression "MQ < 30.0" --filterExpression "QD < 2.0"  --filterExpression "DP < 5" 
3. select for PASS SNPs and biallelic SNPs (as sample is diploid)
4. use the selected SNPs as knownSNPs to do BQSR
5. run Haplotypecaller again on the recalibrated bam
6. go nuts with the gained vcf file... =)

Should i include heterozygous SNPs to generate the BQSR-recalibration file? Would you agree on that workflow or alter the filters ( i know filtering for depth is not a good thing to do but for RNA-seq i think its good to have some minimal coverage of a site)

Comments and recommendations are very welcome, Thank you,


Created 2015-06-05 09:10:36 | Updated | Tags: bqsr best-practices

Comments (1)

We are using a licensed version of GATK here. The version is GATK version -2014.3-3.2.2-7-g f9cba99.

While using the tool for analysis of exome data I had few questions.

1: Are there different parameters used in GATK at different stages for whole and targeted exom sequencing data ?

2: What are the guidelines being followed for targeted exome seq analysis if I use BQSR ?

I was reading the page ("https://www.broadinstitute.org/gatk/guide/tagged?tag=exome") and it was mentioned as " The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets."

Request you to kindly clear the doubts.

Created 2015-06-02 11:52:06 | Updated | Tags: bqsr haplotypecaller

Comments (2)

I was wondering if GATK will behave correctly if I use the -BQSR option to HaplotypeCaller instead of PrintReads, thus saving a processing step in my pipeline. AFAIK, HaplotypeCaller is the only part of our pipeline that actually uses the base score values.

Created 2015-05-23 21:49:38 | Updated | Tags: bqsr haplotypecaller best-practices dnaseq

Comments (1)

Hi, I have ~150 WGS all sequenced in batches on Illumina HiSeq over a number of years, on a number of machines/flowcells.

I want to perform BQSR on my BAM files before calling variants. However for my organism I do not have access to a resource like dbSNP for true variants. So I am following the protocol by doing a first round of variant calling, subsetting to the SNPs with highest confidence and using these as the known variants for BQSR.

My question is, should I carry this out on samples individually, i.e. one BAM per sample, on which I use HaplotypeCaller for initial variant calling, then subset to best SNPs, use these for BaseRecalibrator and apply the calibration to the original sample before carrying on with single sample variant finding and then joint genotyping as per best practices....


As I have multiple samples from the same flowcells and lanes, would I gain more information by combining those samples into a multisample BAM first, before carrying out BQSR? I'm a little unsure of how the covariates used by BQSR and actually calculated and whether I can increase the accuracy of my recalibration in this way? Each sample has between 500M and 5 billion nucleotides sequenced.

Many thanks.

Created 2015-05-21 19:53:49 | Updated 2015-05-21 19:54:06 | Tags: bqsr best-practices

Comments (2)


I am new to NGS/GATK & have a paired-end targeted (targeted to 1 region on 1 autosome) illumina sequencing project on roughly 470 samples. I was wondering whether Indel Realignment & BQSR were appropriate for my setting, or what considerations should be taken into account when judging whether these steps are appropriate?

I'm curious because I noticed this text here: Small targeted experiments The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

I was unsure whether my project qualified as a 'small dataset' where BQSR should not be run.

Thanks for any help you may be able to offer!

Created 2015-04-27 06:58:38 | Updated | Tags: indelrealigner bqsr

Comments (1)


I have gone through the Realignment step and found Re aligner will change the CIGAR of alignment in bam file. Most of the structaral variant detection tool dependent upon CIGAR field. So my question is it right to consider re calibrated bam file, does it has any advantage for SV Detection over Raw Sorted Bam file..?

Created 2015-03-17 08:53:24 | Updated | Tags: bqsr error

Comments (3)

I used the following script:

java -Xmx5g -jar $gatk -T BaseRecalibrator -R $ref_dir/ucsc.hg19.fasta -I $bam_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.dedupped.realigned.bam -knownSites $ref_dir/dbsnp_138.hg19.vcf -knownSites $ref_dir/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites $ref_dir/1000G_phase1.indels.hg19.sites.vcf -o $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recal.grp

java -Xmx5g -jar $gatk -T BaseRecalibrator -R $ref_dir/ucsc.hg19.fasta -I $bam_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.dedupped.realigned.bam -BQSR $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recal.grp -o $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.post.recal.grp -knownSites $ref_dir/dbsnp_138.hg19.vcf -knownSites $ref_dir/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites $ref_dir/1000G_phase1.indels.hg19.sites.vcf

java -Xmx5g -jar $gatk -T AnalyzeCovariates -R $ref_dir/ucsc.hg19.fasta -before $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recal.grp -after $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.post.recal.grp -plots $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recalibration_plots.pdf

java -Xmx5g -jar $gatk -T PrintReads -R $ref_dir/ucsc.hg19.fasta -I $bam_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.dedupped.realigned.bam -BQSR $other_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.recal.grp -o $bam_dir/T-SZ-03-1_NHDE0249-11_C3AP0ACXX_L3.dedupped.realigned.recal.bam

The I got the following error message for the first step:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: java.io.IOException: Input/output error at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:87) at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:74) at htsjdk.samtools.util.AbstractIterator.next(AbstractIterator.java:57) at htsjdk.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:47) at htsjdk.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:25) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:79) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:41) at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40) at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:478) at htsjdk.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:440) at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:338) at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:312) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:402) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:288) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:225) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:148) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:997) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:779) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:290) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:224) at htsjdk.tribble.readers.PositionalBufferedStream.fill(PositionalBufferedStream.java:127) at htsjdk.tribble.readers.PositionalBufferedStream.peek(PositionalBufferedStream.java:118) at htsjdk.tribble.readers.PositionalBufferedStream.read(PositionalBufferedStream.java:57) at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:80) at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:122) at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:85) ... 24 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: java.io.IOException: Input/output error

Created 2015-02-16 15:13:40 | Updated | Tags: bqsr known-sites

Comments (2)


I am working on non-human data and want to use GATK's Haplotype caller to call variants. I am using the protocol that is listed here.

So far I have reached the indel realignment stage and want to do base recalibration next. Since I do not have known sites, this page in the GATK forum advises you to do an initial round of SNP calling on original un-recalibrated data.

Has anybody experience doing this? Is this initial calling done with UnifiedGenotyper on the indel realigned BAMs, or since I want to use Haploytype Caller downstream, should I use this module instead?

Thanks for your input!

Cheers, Mika

Created 2015-02-16 08:45:57 | Updated | Tags: indelrealigner bqsr

Comments (1)

If I am not having information about known variants. Is it fine if I skip the BQSR step after indelRealignment step??

Created 2015-02-13 08:25:46 | Updated 2015-02-13 08:26:24 | Tags: bqsr printreads performance material

Comments (1)

Hi, Here I have no problem :)

I ran a bqsr workflow on 2 set of data that were almost similare (same study, 2 indiv.), on 2 different machines.

Here are my observations : set1 : Dell powerEdge R820, with 128Go Ram, printReads ran in 7h hours with ~35Go ram used. set2 : Dell powerEdge R730 with 193Go ram (and many jobs already running) printReads ran in 1h20 min with ~35Go ram used.

Both sets were launched with the exact same command : -nct 20 set2 output is a little bigger (16Go vs 14Go for set1).

I'd like to know if this huge difference in duration is mostly due to materials, and if it is, which part ? CPU speed, cache, ram access ?


Created 2015-01-19 12:28:18 | Updated | Tags: bqsr

Comments (4)


I am performing BQSR on RNA-seq data for the purpose of SNP calling. I was wondering about some issues:

  1. My organism does not have a known set of SNPs. I asked this question before in the forum and accordingly, I am using a set of SNPs filtered as the input of knownsites for BQSR. In filtering, some SNPs have a tag 'SNPcluster', shall I use this file or shall I somehow filter the file so only the PASS SNPs are retained?

  2. I have performed SNP calling only on a subset of my bam file because I was interested in certain chromosomes. Would that be fine if I only use this subset of SNPs and try to recalibrate only the subset of bam file not the entire bam? I am asking this question in case this can somehow create a bias for final results.

Thanks in advance for your help!

Created 2015-01-09 16:47:35 | Updated | Tags: bqsr haplotypecaller variantfiltration vcf bam workflows snps gatk3

Comments (3)

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

  • 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
  • 02- BAM: MarkDuplicates
  • 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
  • 04- BAM: Calling variants using HaplotypeCaller
  • 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
  • 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
  • 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
  • 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Created 2014-12-05 21:10:20 | Updated | Tags: bqsr best-practices gatk

Comments (2)


I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

Thanks, Alva

Created 2014-12-05 20:16:37 | Updated | Tags: bqsr gatk

Comments (0)


I encounter the "RScript exited with 1. Run with -l DEBUG for more info" error when trying to make recalibration plots using Rscript. However, I do have all the required packages installed on my computer, including gsalib, which seems to be the problem. When I run with -l DEBUG, I get the following information:

Error in library(gsalib) : there is no package called ‘gsalib’

The thing is that I am using my computer but working in a cluster, so not directly on my computer. Should I have the packages installed in the cluster or is it fine that I've installed the packages on my computer? I'm a bit new to this, so I apologize if this question is a bit basic.

Thanks, Alva

Created 2014-11-25 19:25:43 | Updated 2014-11-25 19:27:57 | Tags: bqsr baserecalibrator

Comments (100)

This discussion was created from comments split from: Base Quality Score Recalibration (BQSR). To ask new questions, please post on the article linked above.

Created 2014-11-18 08:41:46 | Updated | Tags: bqsr contextcovariate

Comments (6)

hi, What does the covariate "ContextCovariate" in BQSR mean? Is it "reference sequence context" or "read context"? Is it the same with "Dinuc" in older versions of gatk? Thanks a lot!

Created 2014-10-16 16:06:17 | Updated 2014-10-16 16:09:27 | Tags: bqsr

Comments (2)

Hi, I am using GATK version 3.2-2-gec30cee. I have a pb with BQSR. Whwn I run the first step (analyze covariates), everything seems to be fine according to the INFO given by the walker: Command used: java -jar -Xmx16g $GATK -T BaseRecalibrator -R ./assembly/spades_reapr.genome.abv500.fna -I BAMsort-RG-dedup-realign_Miseq300.bam -knownSites CEW1_Miseq300_HC0.vcf -o CEW1_Miseq300_BQSR1.table INFOs read on shell:

INFO 16:11:41,114 BaseRecalibrator - ...done!
INFO 16:11:41,115 BaseRecalibrator - BaseRecalibrator was able to recalibrate 33940210 reads
INFO 16:11:41,116 ProgressMeter - done 3.3941981E7 86.9 m 2.6 m 100.0% 86.9 m 0.0 s
INFO 16:11:41,117 ProgressMeter - Total runtime 5214.59 secs, 86.91 min, 1.45 hours
INFO 16:11:41,117 MicroScheduler - 2144047 reads were filtered out during the traversal out of approximately 36086028 total reads (5.94%)
INFO 16:11:41,118 MicroScheduler - -> 1567207 reads (4.34% of total) failing DuplicateReadFilter
INFO 16:11:41,118 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 16:11:41,119 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 16:11:41,119 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 16:11:41,120 MicroScheduler - -> 546586 reads (1.51% of total) failing MappingQualityZeroFilter
INFO 16:11:41,120 MicroScheduler - -> 30254 reads (0.08% of total) failing NotPrimaryAlignmentFilter
INFO 16:11:41,121 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
INFO 16:11:42,763 GATKRunReport - Uploaded run statistics report to AWS S

But the generated table (CEW1_Miseq300_BQSR1.table) is empty: it is a blank file, as if something went wrong in the previous step...

As expected, when I go to calculate remaining covariates afer plot: java -jar -Xmx16g $GATK -T BaseRecalibrator -R ./assembly/spades_reapr.genome.abv500.fna -I BAMsort-RG-dedup-realign_Miseq300.bam -knownSites CEW1_Miseq300_HC0.vcf -BQSR CEW1_Miseq300_BQSR1.table -o CEW1_Miseq300_post-BQSR1.table I get the following error message:

ERROR A USER ERROR has occurred (version 3.2-2-gec30cee):
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR MESSAGE: Bad input: The GATK report has no version specified in the header
ERROR ------------------------------------------------------------------------------------------

Do you know what's happening? Could you help fix this?

Many thanks!


Created 2014-09-02 12:38:46 | Updated 2014-09-02 12:48:19 | Tags: vqsr bqsr gatk

Comments (1)

Hi there, I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step? Also what should I do when I reach to step- VQSR? Hope some GATK developers can help me on this. Thanks.

Created 2014-08-01 11:05:01 | Updated | Tags: bqsr baserecalibrator

Comments (3)

I am recalibrating the base qualities, using all the standard covariates, including machine cycle. The reads come from a 454 machine, and they have an average length of around 300 bases. Curiously, the recalibrator only issues an error if I set the --maximum_cycle_value to less than 27. And even if I set the --maximum_cycle_value to the true maximum length (1601), the RecalTable2 does not record cycles longer than 27. In DEBUG level, I get a few messages that seem to be related, but that I don't know how to interpret:

DEBUG 11:45:23,986 ReadCovariates - Keys cache miss for length 39 cache size 1 DEBUG 11:45:23,988 ReadCovariates - Keys cache miss for length 38 cache size 2 DEBUG 11:45:23,995 ReadCovariates - Keys cache miss for length 41 cache size 3 DEBUG 11:45:23,999 ReadCovariates - Keys cache miss for length 36 cache size 4 DEBUG 11:45:24,001 ReadCovariates - Keys cache miss for length 31 cache size 5 DEBUG 11:45:24,004 ReadCovariates - Keys cache miss for length 43 cache size 6 DEBUG 11:45:24,010 ReadCovariates - Keys cache miss for length 26 cache size 7 DEBUG 11:45:24,015 ReadCovariates - Keys cache miss for length 40 cache size 8 DEBUG 11:45:24,026 ReadCovariates - Keys cache miss for length 30 cache size 9 DEBUG 11:45:24,037 ReadCovariates - Keys cache miss for length 37 cache size 10 DEBUG 11:45:24,079 ReadCovariates - Keys cache miss for length 33 cache size 11 DEBUG 11:45:24,086 ReadCovariates - Keys cache miss for length 44 cache size 12 DEBUG 11:45:24,152 ReadCovariates - Keys cache miss for length 35 cache size 13 DEBUG 11:45:24,606 ReadCovariates - Keys cache miss for length 32 cache size 14 DEBUG 11:45:24,654 ReadCovariates - Keys cache miss for length 28 cache size 15 DEBUG 11:45:24,683 ReadCovariates - Keys cache miss for length 34 cache size 16 DEBUG 11:45:24,716 ReadCovariates - Keys cache miss for length 45 cache size 17 DEBUG 11:45:24,963 ReadCovariates - Keys cache miss for length 27 cache size 18 DEBUG 11:45:25,033 ReadCovariates - Keys cache miss for length 29 cache size 19 DEBUG 11:45:27,401 ReadCovariates - Keys cache miss for length 25 cache size 20 DEBUG 11:45:28,445 ReadCovariates - Keys cache miss for length 46 cache size 21 DEBUG 11:45:29,426 ReadCovariates - Keys cache miss for length 23 cache size 22 DEBUG 11:45:31,228 ReadCovariates - Keys cache miss for length 47 cache size 23

The fact is that after recalibration the low-qualitity tails get higher qualities, and I suspect that the recalibration does not work well for late cycles. I'm using GATK v3.2-2-gec30cee. This is what my command looks like:

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I realigned.bam -R ref.fa -o recal_data.table --maximum_cycle_value 1601 --run_without_dbsnp_potentially_ruining_quality -l DEBUG

One particularity of the bam file is that it's not huge. There are only about 500000 reads. Recalibration may not be accurate, but still, I wonder if it is using all the information available. There are hundreds of thousands of reads longer than 100 bases. Can anybody explain what's happening? Thanks.

Created 2014-07-30 15:41:15 | Updated | Tags: bqsr

Comments (4)

Hi! I’m trying to plan my GATK pipeline and have a few questions. We currently have an assembled genome for a non-model avian species and would like to align the reads from the single individual to the genome and subsequently mine SNPs. Our end game is to design a Fluidigm SNP chip. Since we are only working with the data from a single individual, I am trying to figure out how to best minimize false positive SNP calls.

1) Prior to beginning mapping with BWA, is any pre-processing of the reads recommended? The tutorials say to use raw reads… do we not even need to remove adaptors?

2) Does GATK have any specific recommendations for non-model BQSR (i.e., when no reference set of high-quality SNPs is available)? I’ve found a couple references to this process (http://gatkforums.broadinstitute.org/discussion/3286/quality-score-recalibration-for-non-model-organisms, https://gist.github.com/brantfaircloth/4315737). One user indicated they were using SNPs with quality scores greater than 30 as their reference SNPs for BQSR, another 90. Does GATK have any other thoughts, or is this largely a judgment call?

Created 2014-06-27 21:39:44 | Updated | Tags: bqsr sensitivity hypermutation

Comments (2)

From the GATK FAQs - "The base quality score recalibrator treats every reference mismatch as indicative of machine error"... which is why we have to give it a list of dbSNPs to skip over. But what about somatic SNPs? Wouldn't hypermutated tumors from uterine, colorectal, melanoma, or lung cancers be re-calibrated to a lower quality than data from AML or breast. And variant caller sensitivity would drop accordingly.

Should we ideally do some high-confidence SNP calling on un-calibrated BAMs, and append those to the dbSNP VCF for the BQSR?

Thanks, Cyriac

Created 2014-06-27 12:14:16 | Updated | Tags: bqsr exome multiplexing

Comments (5)

Dear Geraldine and GATK experts,

I have attended the great Brussels workshop, and I am posting here the BQSR question I had.

I have a human exome experiment with 24-multiplexed samples per lane (Nextera libraries) where we first only did one lane of sequencing (~15x) and then added a second lane (summing up to ~30x). From what I understood reading the Best Practices, I probably don't have enough data to run a BQSR on each sample. How should I then do the BQSR step? Should I skip it altogether? Could I estimate the model parameters on one whole lane (all the samples) and then apply it separately to each sample?

And as a separate question: If I could turn back the clock, would it have been better to do 12-multiplexed samples per lane and run these two lanes of sequencing (24 samples in total) for the same amount of reads but giving me more data to do a BQSR step per sample?

Thanks a lot for your help!

Created 2014-06-10 17:25:23 | Updated | Tags: bqsr baserecalibrator

Comments (11)

I know the best practices document is on its way, but I recently came across this quite new page on how to use the -L argument as was surprised to see that:

Small Targeted Experiments The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

I knew that VQSR could not be used on small targeted experiments, but didn't know that BQSR should not be used. The Base Quality Score Recalibration page includes the note:

A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.

If I have 150Mbases of data over a set of small target intervals, does that count as a small dataset for which I should avoid using BQSR? What about 1B bases, again over a small set of intervals? What are the best practices in this case?

Thanks for your help!

Created 2014-06-10 16:11:32 | Updated | Tags: bqsr baserecalibrator

Comments (7)

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR MESSAGE: SAM/BAM file SAMFileReader{xxxx.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (72) with BAQ correction factor of 8; please see the GATK --help documentation for options related to this error

Created 2014-05-14 15:17:29 | Updated 2014-05-14 15:19:08 | Tags: bqsr baserecalibrator printreads

Comments (1)

Hello GATK team,

I have a question regarding the PrintReads walker. I am running it with the --BQSR engine using the command below. While my job has yet to finish, it is clear that the output bam file will be significantly larger than the input (~2x). I have not set the –emit_original_quals flag, so I expect that the original scores should be discarded. Should I still be expecting this size increase?

     java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/applyRecal_${sample}\
     -jar ${gatk}\
     -T PrintReads\
     -R ${refSequence}\
     -I ${SCRATCH}/${sample}.realigned.bam\
     -BQSR ${SCRATCH}/${sample}.recal_data.table\
     -o ${SCRATCH}/${sample}.recalibrated.bam\
     -nct 4

Thanks so much,


Created 2014-04-17 11:34:39 | Updated 2014-04-17 16:09:45 | Tags: indelrealigner bqsr knownsites mouse

Comments (5)

Hello again,

More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps.

I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf):

##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ)
##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus)
##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2)
##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5)
##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) [0]">
##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]">
##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps [3]">
##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) []">
##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) [0]">
##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) [200]">
##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) [5]">
##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) [5]">
##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) [20]">
##FILTER=<ID=Qual,Description="Minimum value of the QUAL field [10]">
##FILTER=<ID=RefN,Description="Reference base is N []">
##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered [2]">
##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]">
##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) [0]">

When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".

However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.

So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.

Thanks very much!

Created 2014-04-10 16:24:48 | Updated | Tags: indelrealigner realignertargetcreator bqsr knownsites mouse indel-realignment

Comments (5)


I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO 15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 INFO 15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
ERROR ------------------------------------------------------------------------------------------`

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!

Created 2014-03-05 08:09:15 | Updated | Tags: bqsr bam header

Comments (7)

When I run the BQSR with GATK 2.8.1, I can't check the PG information on output bam file.

BWA->Picard dedup->GATK LocalRealign->GATK BQSR

Below is the output BAM file after BQSR step.

@PG ID:GATK IndelRealigner VN:2.3-9-gdcdccbb CL:knownAlleles=[(RodBinding name=knownAlleles source=/nG/Reference/hgdownload.cse.ucsc.edu/goldenPath/hg19/KNOWNINDEL/Mills_and_1000G_gold_standard.indels.hg19.vcf)] targetIntervals=/nG/Data/2077/vcf1/node1/chrY/Databind/chrY.bam.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.92(1464) CL:net.sf.picard.sam.MarkDuplicates INPUT=[/nG/Data/2077/step2_makebam/node1-1.bam, /nG/Data/2077/step2_makebam/node1-2.bam, /nG/Data/2077/step2_makebam/node1-3.bam, /nG/Data/2077/step2_makebam/node1-4.bam, /nG/Data/2077/step2_makebam/node1-5.bam, /nG/Data/2077/step2_makebam/node1-6.bam] OUTPUT=/dev/stdout METRICS_FILE=/nG/Data/2077/temp/picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000000 TMP_DIR=[/nG/Data/2077/temp] QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=2000000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO CREATE_INDEX=false CREATE_MD5_FILE=false

Related thread (guess) http://gatkforums.broadinstitute.org/discussion/2118/baserecalibration-prinread-don-t-create-a-header-and-don-t-obtain-oq-orignal-base-quality-in-bam

Created 2013-12-06 15:21:40 | Updated | Tags: bqsr

Comments (2)

Hi, Looking in the forum, I can´t see how to correct this. I am using java 1.7, sorted HG19 from UCSC and sorted HG19 dbnsp from the bundle

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I PAN001N.rmdup.bam -R HG19.fasta -knownSites dbsnp_137.hg19.sorted.vcf -o recalibration_report.grp

The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -37 at org.broadinstitute.sting.utils.BaseUtils.convertIUPACtoN(BaseUtils.java:172) at org.broadinstitute.sting.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:288) at org.broadinstitute.sting.gatk.datasources.providers.ReferenceView.getReferenceBases(ReferenceView.java:116) at org.broadinstitute.sting.gatk.datasources.providers.ReadReferenceView$Provider.getBases(ReadReferenceView.java:87) at org.broadinstitute.sting.gatk.contexts.ReferenceContext.fetchBasesFromProvider(ReferenceContext.java:145) at org.broadinstitute.sting.gatk.contexts.ReferenceContext.getBases(ReferenceContext.java:189) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateIsSNP(BaseRecalibrator.java:335) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:253) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

As a note, I validated the BAM with picard/validateSamFile, with no errors found

Do you have some ideas? Thanks!!

Created 2013-11-07 14:25:18 | Updated | Tags: unifiedgenotyper bqsr computeconfusionmatrix

Comments (5)


I have a question regarding the BQSR param when running the UnifiedGenotyper. If I leave away that param, will there be a "default" confusion matrix applied for (single-sample) SNP calling?

Best, Cindy

Created 2013-11-01 17:56:45 | Updated | Tags: bqsr baserecalibrator

Comments (1)

I have a study including 100 samples. While most samples were sequenced in one lane, a dozen were in two lanes. I wonder if the difference in # of lanes may lead to difference in overall base quality scores after BQSR?

Created 2013-09-29 17:07:55 | Updated | Tags: bqsr

Comments (2)

I have been working primarily with non-model organisms (and mostly inbred-mapping populations, but that's a topic for a different discussion). To recalibrate base qualities, I have taken the approach of running through the Indel Realignment, SNP, and INDEL calling. Then, filtering around INDELs. I use multi-sample VCFs and have taken the following approach to recalibrate base quality: I grab the top 90th percentile SNPs from all SNPs in my filtered SNP VCF file (based on ALTQ), then I pull out these top SNPs for each SAMPLE in the VCF file (in my case I usually have between 100-300 samples) and write to SEPARATE VCF files for each SAMPLE if the GQ > 90 and it's a SNP for that sample. I then use these SAMPLE HQ VCF files for the BQSR tools.

I have a simple python script for this located here

usage: GetHighQualVcfs.py [-h] -i INFILE -o OUTDIR [--ploidy PLOIDY] [--GQ GQ]
                          [--percentile PERCENTILE]

Split multi-sample VCFs into single sample VCFs of high quality SNPs.

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Multi-sample VCF file
  -o OUTDIR, --outdir OUTDIR
                        Directory to output HQ VCF files.
  --ploidy PLOIDY       1 for haploid; 2 for diploid
  --GQ GQ               Filters out variants with GQ < this limit.
  --percentile PERCENTILE
                        Reduces to variants with ALTQ > this percentile.

Thoughts? Concerns? Perhaps I'm going about this in a completely wrong way?

Created 2013-09-23 07:53:40 | Updated | Tags: bqsr queue parallelism

Comments (27)

When using queue for BQSR scatter/gather parellelism I've been seeing the following:

java.lang.IllegalArgumentException: Table1 188,3 not equal to 189,3
        at org.broadinstitute.sting.utils.recalibration.RecalUtils.combineTables(RecalUtils.java:808)
        at org.broadinstitute.sting.utils.recalibration.RecalibrationReport.combine(RecalibrationReport.java:147)
        at org.broadinstitute.sting.gatk.walkers.bqsr.BQSRGatherer.gather(BQSRGatherer.java:86)
        at org.broadinstitute.sting.queue.function.scattergather.GathererFunction.run(GathererFunction.scala:42)
        at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:53)
        at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

I'm using gatk version: v2.4-7-g5e89f01 (I can't keep up the pace with you guys). I'm wondering if this is a know issue, and if so, if there is a workaround or a fix in later GATK versions.

Cheers, Johan

Created 2013-09-09 17:34:44 | Updated | Tags: bqsr printreads parallelism multithreading

Comments (1)


I am trying to test -nct parameter in printreads (bqsr). My machine has 32 cores (aws cc2.8xlarge) and input is a single BAM. So I tried "-nct 32" first, but it was almost as slow as "-nct 1". Actually any -nct >8 made it slow, so I am wondering if it's my fault or known/expected result. Thanks!

Created 2013-08-19 08:49:20 | Updated 2013-08-19 08:59:23 | Tags: bqsr knownsites

Comments (2)

I mapped data against the human reference provided in the GATK_b37_bundle resource bundle and I am now trying to run BQSR using the recommended known variant sets from the same resource bundle.

Upon including the Mills_and_1000G_gold_standard.indels.b37.vcf known variant set I get the following error:

##### ERROR contig knownSites = MT / 16571 ##### ERROR contig reference = MT / 16569

The header of the Mills_and_1000G_gold_standard.indels.b37.vcf seems to the indicate that the correct 16569 bp MT version is used for the VCF file


Why does the BQSR tool think that a different version of MT is used for the Mills_and_1000G_gold_standard.indels.b37.vcf ?


I have the same problem with the 1000G_phase1.indels.b37.vcf from the GATK_b37_bundle. Get the same error and the MT contig seems the be the correct one from the vcf header. Only the dbsnp_137.b37.vcf is accepted by the BQSR tool without complaining about a different MT contig.

Created 2013-03-22 12:52:46 | Updated | Tags: bqsr

Comments (5)

Hi, I am working on a data set in which (1) multiple individuals were sequenced on the same lane, and (2) the same individuals were sequenced in multiple runs. If I get this right, we would thus want to consider both lane and run as covariates in BQSR. I have two questions related to this: 1) Which elements of the @RG header are considered by BQSR? All given there (e.g. ID, SM, LB, PI, PL)? 2) I am not sure where (in which @RG elemnt) to best incorporate run and lane info?

cheers, R

Created 2013-01-24 20:48:12 | Updated | Tags: bqsr baserecalibrator

Comments (9)

I cannot produce BQSR plots, although I can open the grp file with gsa.read.gatkreport.

Here's the command:

java -Xmx1g -jar $shares/GenomeAnalysisTK-2.3-6-gebbba25/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I ./0.reorder.bam \ -R $shares/ftp.broadinstitute.org/bundle/2.3/hg19/ucsc.hg19.fasta \ -knownSites $shares/ftp.broadinstitute.org/bundle/2.3/hg19/dbsnp_137.hg19.vcf \ -BQSR ./0.reorder.bam.recal.grp \ -o ./0.reorder.bam.post_recal.grp \ --plot_pdf_file ./0.reorder.bam.post_recal.grp.pdf \ -L chr1:1-1000 \ -l DEBUG \ --intermediate_csv_file ./0.reorder.bam.post_recal.grp.csv

##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.sting.utils.Utils.join(Utils.java:286) at org.broadinstitute.sting.utils.recalibration.RecalUtils.writeCSV(RecalUtils.java:450) at org.broadinstitute.sting.utils.recalibration.RecalUtils.generateRecalibrationPlot(RecalUtils.java:394) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.generatePlots(BaseRecalibrator.java:474) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:464) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:97) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

It looks like the csv file is not being produced.


Created 2013-01-17 18:20:24 | Updated | Tags: bqsr filter

Comments (3)

Hi, I would like to perform base quality score recalibration on only the reads that have the "properly aligned" bit (0x2) set in the FLAG column of the SAM format. Ideally, I would like to use the --read_filter argument. Below is some code that does this to my satisfaction with the PrintReads walker of GATK 2 lite. However, GATK 2 lite does not support base quality score recalibration table creation. Is there any way someone could add the code to the GATK 2 full version?

I am not sure why, but the code seems to only work with the System.out.println() line.

Thanks, Winni


  • code written by Kiran Garimella */

package org.broadinstitute.sting.gatk.filters;

import net.sf.samtools.SAMRecord;

public class ProperPairFilter extends ReadFilter { @Override public boolean filterOut(SAMRecord samRecord) { System.out.println(samRecord.getProperPairFlag()); return !samRecord.getProperPairFlag(); } }

Created 2012-11-12 22:55:36 | Updated 2013-01-07 20:06:43 | Tags: indelrealigner bqsr best-practices baserecalibration

Comments (1)


I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

  • MarkDuplicates
  • Count Covariates
  • Table Recalibration
  • Realigner Target Creator
  • Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.


Created 2012-10-23 06:21:29 | Updated 2012-10-25 15:27:41 | Tags: bqsr best-practices known special-cases

Comments (45)

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

  1. Remove duplicates via MarkDuplicates.
  2. RealignerTargetCreator
  3. IndelRealigner using the --consensusDeterminationModel USE_SW flag
  4. Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!


Created 2012-10-06 18:59:15 | Updated 2013-01-07 19:49:26 | Tags: bqsr baserecalibrator printreads

Comments (9)

What are the BD and BI flags that get added to my bam files after base recalibration? They seem to consist of a long string of "N"s, and I'm trying to understand if that is correct.


Created 2012-09-09 02:52:32 | Updated 2012-09-09 02:53:52 | Tags: indelrealigner bqsr selectvariants

Comments (2)

My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as --knownSites or --known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.

Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?

Created 2012-08-13 18:49:39 | Updated 2012-10-03 19:30:14 | Tags: bqsr baserecalibrator

Comments (58)

This method is described to be the "First pass of the base quality score recalibration". What is the second pass? It is not mentioned anywhere, or am I looking in the wrong place? In v1.2 there were two steps, is there only one step now for bqsr? Confused, Juan

Created 2012-08-10 01:08:49 | Updated 2012-08-10 01:13:06 | Tags: bqsr baserecalibrator gatk2

Comments (84)

I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new BaseRecalibrator walker. According to this post the BaseRecalibrator should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the BaseRecalibrator and PrintReads steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying GATKReport but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).

Can you please help? Thanks