Tagged with #baserecalibration
1 documentation article | 0 announcements | 9 forum discussions



Created 2012-07-23 17:12:33 | Updated 2016-01-15 18:08:32 | Tags: bqsr baserecalibrator printreads baserecalibration

Comments (52)

Detailed information about command line options for BaseRecalibrator can be found here.

Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

  • Reported quality score
  • The position within the read
  • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for (sophisticated) users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

  • read group the read belongs to
  • assigned quality score
  • machine cycle producing this base
  • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

 
java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

  • the sum of the global difference between reported quality scores and the empirical quality
  • plus the quality bin specific shift
  • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

  • The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
  • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
  • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
  • The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

  • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
  • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

The output of the BaseRecalibrator

  • A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

  • Arguments Table -- a table with all the arguments and its values
  • Quantization Table
  • ReadGroup Table
  • Quality Score Table
  • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

 
#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

 
#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

 
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

  • First do an initial round of SNP calling on your original, unrecalibrated data.
  • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
  • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

No articles to display.


Created 2016-03-07 12:18:15 | Updated 2016-03-07 12:20:27 | Tags: haplotypecaller baserecalibration joint-discovery cohort-analysis non-model-organism

Comments (3)

Hi,

In case we do not have a database of known sites for a non model organism, what is the best strategy to do the base recalibration and variant calling with joint genotyping when we have several samples ?

According to the GATK best practices, when we do not have a data base of known sites, it is recommended to run HaplotypeCaller and hard filter the SNPs then use the resultings SNPs for the base recalibration until convergence. As I have many samples (31), I have to do this on all the samples. However, as, after the base recalibration, I want to do the final snp calling with joint genotyping, I think this approach is going to be really time consuming...

So, I'm wondering whether I can execute an initial snp calling (with HaplotypeCaller) on each sample, then joint genotyping them to obtain a global snp data that I will manually filter. This joint and filtered snp database could then be used for the base recalibration and a second joint genotyping. I think that it will be faster this way. However, I'm not sure whether there are caveats in this approach and I would like to have your advices.

Thank you very much in advance.

Regards, Noppol.


Created 2016-02-22 22:56:49 | Updated | Tags: baserecalibrator baserecalibration

Comments (5)

Hi GATK team,

I'm running into a new problem. I'm running BaseRecalibrator (GATK v3.5-0-g36282e4) and it's just stopping mid-run while it is creating the recal tables. For instance, it will run for an hour or so and just stop when it should have 7m left. Then, in my script, it goes to the next command, which is "PrintReads" and throws the error: Bad input: The GATK report has no version specified in the header

I think this means that the recal.table file isn't written properly and is incomplete. Could that be the case?

Note that the first command just stops on the line that begins with "INFO 17:10:13,787"

INFO 16:32:37,571 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:32:37,576 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 16:32:37,576 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:32:37,576 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:32:37,580 HelpFormatter - Program Args: -nct 8 -R /project/mplattlab/data/genomes/rhemac2/rhemac2.fa -knownSites /project/mplattlab/data/variant_calling/rhemac2/intermediatebams/cayo.UG1_mpileup.vcf.gz -rf BadCigar -I /project/mplattlab/data/variant_calling/rhemac2/intermediatebams/3F4.realigned.bam -T BaseRecalibrator -o /project/mplattlab/data/variant_calling/rhemac2/intermediatebams/3F4.recal.table INFO 16:32:37,583 HelpFormatter - Executing as snydern@node083.hpc.local on Linux 2.6.32-504.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17. INFO 16:32:37,584 HelpFormatter - Date/Time: 2016/02/22 16:32:37 INFO 16:32:37,584 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:32:37,584 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:32:38,048 GenomeAnalysisEngine - Strictness is SILENT INFO 16:32:38,120 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 16:32:38,126 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:32:38,169 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 WARN 16:32:38,338 IndexDictionaryUtils - Track knownSites doesn't have a sequence dictionary built in, skipping dictionary validation INFO 16:32:38,345 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine INFO 16:32:38,413 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 16:32:38,417 GenomeAnalysisEngine - Done preparing for traversal INFO 16:32:38,417 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 16:32:38,418 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 16:32:38,418 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 16:32:38,441 BaseRecalibrator - The covariates being used here:
INFO 16:32:38,441 BaseRecalibrator - ReadGroupCovariate INFO 16:32:38,441 BaseRecalibrator - QualityScoreCovariate INFO 16:32:38,441 BaseRecalibrator - ContextCovariate INFO 16:32:38,441 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 16:32:38,442 BaseRecalibrator - CycleCovariate INFO 16:32:38,445 ReadShardBalancer$1 - Loading BAM index data INFO 16:32:38,447 ReadShardBalancer$1 - Done loading BAM index data INFO 16:33:08,424 ProgressMeter - chr1:17156225 200002.0 30.0 s 2.5 m 0.6% 83.5 m 83.0 m INFO 16:33:38,425 ProgressMeter - chr1:45361623 700007.0 60.0 s 85.0 s 1.6% 63.1 m 62.1 m INFO 16:34:08,752 ProgressMeter - chr1:72017432 1200012.0 90.0 s 75.0 s 2.5% 59.7 m 58.2 m INFO 16:34:39,063 ProgressMeter - chr1:104984857 1600017.0 120.0 s 75.0 s 3.7% 54.6 m 52.6 m INFO 16:35:09,079 ProgressMeter - chr1:134337834 2000023.0 2.5 m 75.0 s 4.7% 53.3 m 50.8 m INFO 16:35:39,082 ProgressMeter - chr1:164978926 2500028.0 3.0 m 72.0 s 5.8% 52.1 m 49.1 m INFO 16:36:09,083 ProgressMeter - chr1:194735456 2900032.0 3.5 m 72.0 s 6.8% 51.5 m 48.0 m INFO 16:36:39,085 ProgressMeter - chr1:221199137 3300036.0 4.0 m 72.0 s 7.7% 51.8 m 47.8 m INFO 16:37:09,088 ProgressMeter - chr2:25443780 3686028.0 4.5 m 73.0 s 8.9% 50.8 m 46.3 m INFO 16:37:39,190 ProgressMeter - chr2:61539361 4186036.0 5.0 m 71.0 s 10.1% 49.4 m 44.4 m INFO 16:38:09,191 ProgressMeter - chr2:97847826 4786042.0 5.5 m 69.0 s 11.4% 48.3 m 42.8 m INFO 16:38:39,227 ProgressMeter - chr2:133160916 5286047.0 6.0 m 68.0 s 12.6% 47.5 m 41.5 m INFO 16:39:09,228 ProgressMeter - chr2:169269389 5786053.0 6.5 m 67.0 s 13.9% 46.8 m 40.3 m INFO 16:39:39,242 ProgressMeter - chr3:11454899 6249081.0 7.0 m 67.0 s 15.0% 46.7 m 39.7 m INFO 16:40:09,290 ProgressMeter - chr3:40227299 6749148.0 7.5 m 66.0 s 16.0% 46.9 m 39.4 m INFO 16:40:39,294 ProgressMeter - chr3:78319439 7249156.0 8.0 m 66.0 s 17.3% 46.2 m 38.2 m INFO 16:41:09,376 ProgressMeter - chr3:112562912 7749161.0 8.5 m 65.0 s 18.5% 45.9 m 37.4 m INFO 16:41:40,903 ProgressMeter - chr3:155256070 8249169.0 9.0 m 65.0 s 20.0% 45.1 m 36.1 m INFO 16:42:10,904 ProgressMeter - chr3:190721485 8749175.0 9.5 m 65.0 s 21.3% 44.9 m 35.3 m INFO 16:42:40,906 ProgressMeter - chr4:24882535 9246702.0 10.0 m 65.0 s 22.3% 44.9 m 34.9 m INFO 16:43:10,907 ProgressMeter - chr4:49850788 9746728.0 10.5 m 64.0 s 23.2% 45.4 m 34.9 m INFO 16:43:40,918 ProgressMeter - chr4:88886306 1.0146733E7 11.0 m 65.0 s 24.6% 44.9 m 33.9 m INFO 16:44:10,922 ProgressMeter - chr4:126305800 1.064674E7 11.5 m 65.0 s 25.9% 44.6 m 33.1 m INFO 16:44:40,924 ProgressMeter - chr4:162848331 1.1146746E7 12.0 m 64.0 s 27.1% 44.3 m 32.3 m INFO 16:45:10,928 ProgressMeter - chr5:23495201 1.1614781E7 12.5 m 64.0 s 28.1% 44.6 m 32.0 m INFO 16:45:40,929 ProgressMeter - chr5:54955190 1.2014785E7 13.0 m 65.0 s 29.2% 44.6 m 31.6 m INFO 16:46:10,933 ProgressMeter - chr5:91794349 1.2514805E7 13.5 m 64.0 s 30.5% 44.4 m 30.8 m INFO 16:46:40,934 ProgressMeter - chr5:128678439 1.2914809E7 14.0 m 65.0 s 31.8% 44.1 m 30.1 m INFO 16:47:10,935 ProgressMeter - chr5:160103569 1.3314813E7 14.5 m 65.0 s 32.9% 44.2 m 29.6 m INFO 16:47:41,764 ProgressMeter - chr6:11636944 1.378561E7 15.1 m 65.0 s 34.1% 44.2 m 29.1 m INFO 16:48:11,767 ProgressMeter - chr6:43059322 1.4285617E7 15.6 m 65.0 s 35.2% 44.2 m 28.7 m INFO 16:48:41,768 ProgressMeter - chr6:69367623 1.4785631E7 16.1 m 65.0 s 36.1% 44.5 m 28.4 m INFO 16:49:11,837 ProgressMeter - chr6:106402853 1.5285636E7 16.6 m 64.0 s 37.4% 44.3 m 27.7 m INFO 16:49:41,838 ProgressMeter - chr6:143036256 1.5785641E7 17.1 m 64.0 s 38.7% 44.1 m 27.1 m INFO 16:50:11,842 ProgressMeter - chr6:174124643 1.6285647E7 17.6 m 64.0 s 39.7% 44.2 m 26.6 m INFO 16:50:41,843 ProgressMeter - chr7:29322048 1.6750392E7 18.1 m 64.0 s 40.9% 44.1 m 26.1 m INFO 16:51:11,845 ProgressMeter - chr7:59229721 1.7250397E7 18.6 m 64.0 s 42.0% 44.2 m 25.7 m INFO 16:51:41,847 ProgressMeter - chr7:85301617 1.7850419E7 19.1 m 64.0 s 42.9% 44.4 m 25.4 m INFO 16:52:11,852 ProgressMeter - chr7:127170968 1.8350424E7 19.6 m 63.0 s 44.3% 44.1 m 24.6 m INFO 16:52:41,853 ProgressMeter - chr7:161516598 1.8850429E7 20.1 m 63.0 s 45.5% 44.0 m 24.0 m INFO 16:53:11,856 ProgressMeter - chr8:10619904 1.9318596E7 20.6 m 63.0 s 46.2% 44.5 m 23.9 m INFO 16:53:41,863 ProgressMeter - chr8:42355688 1.9818601E7 21.1 m 63.0 s 47.3% 44.5 m 23.5 m INFO 16:54:11,865 ProgressMeter - chr8:78323666 2.0418609E7 21.6 m 63.0 s 48.5% 44.4 m 22.8 m INFO 16:54:41,867 ProgressMeter - chr8:116054511 2.0818615E7 22.1 m 63.0 s 49.9% 44.2 m 22.2 m INFO 16:55:11,868 ProgressMeter - chr8:147792144 2.1418623E7 22.6 m 63.0 s 51.0% 44.2 m 21.7 m INFO 16:55:41,872 ProgressMeter - chr9:20426750 2.1868507E7 23.1 m 63.0 s 51.7% 44.6 m 21.5 m INFO 16:56:11,877 ProgressMeter - chr9:53707301 2.2368512E7 23.6 m 63.0 s 52.8% 44.6 m 21.0 m INFO 16:56:41,882 ProgressMeter - chr9:89871581 2.2868517E7 24.1 m 63.0 s 54.1% 44.4 m 20.4 m INFO 16:57:12,054 ProgressMeter - chr9:122731316 2.3468524E7 24.6 m 62.0 s 55.3% 44.4 m 19.9 m INFO 16:57:42,062 ProgressMeter - chr10:17454779 2.3986596E7 25.1 m 62.0 s 56.2% 44.5 m 19.5 m INFO 16:58:12,067 ProgressMeter - chr10:50223841 2.4486601E7 25.6 m 62.0 s 57.4% 44.5 m 19.0 m INFO 16:58:42,073 ProgressMeter - chr10:77360320 2.4986606E7 26.1 m 62.0 s 58.3% 44.7 m 18.6 m INFO 16:59:12,177 ProgressMeter - chr11:8993476 2.5518831E7 26.6 m 62.0 s 59.3% 44.8 m 18.3 m INFO 16:59:42,182 ProgressMeter - chr11:38620538 2.5918836E7 27.1 m 62.0 s 60.3% 44.9 m 17.8 m INFO 17:00:12,184 ProgressMeter - chr11:74497791 2.6418841E7 27.6 m 62.0 s 61.5% 44.8 m 17.2 m INFO 17:00:42,282 ProgressMeter - chr11:114588057 2.7018847E7 28.1 m 62.0 s 62.9% 44.6 m 16.5 m INFO 17:01:13,608 ProgressMeter - chr12:2663171 2.7545983E7 28.6 m 62.0 s 63.7% 44.9 m 16.3 m INFO 17:01:43,668 ProgressMeter - chr12:49229431 2.8045989E7 29.1 m 62.0 s 65.4% 44.5 m 15.4 m INFO 17:02:13,672 ProgressMeter - chr12:88250627 2.8545994E7 29.6 m 62.0 s 66.7% 44.3 m 14.8 m INFO 17:02:43,673 ProgressMeter - chr13:10031645 2.9040692E7 30.1 m 62.0 s 67.7% 44.4 m 14.3 m INFO 17:03:13,675 ProgressMeter - chr13:45017985 2.9540698E7 30.6 m 62.0 s 68.9% 44.4 m 13.8 m INFO 17:03:43,676 ProgressMeter - chr13:80475888 3.0040704E7 31.1 m 62.0 s 70.2% 44.3 m 13.2 m INFO 17:04:13,681 ProgressMeter - chr13:113180238 3.054071E7 31.6 m 62.0 s 71.3% 44.3 m 12.7 m INFO 17:04:43,757 ProgressMeter - chr13:138024042 3.1040715E7 32.1 m 62.0 s 72.2% 44.5 m 12.4 m INFO 17:05:13,758 ProgressMeter - chr14:38332097 3.1482365E7 32.6 m 62.0 s 73.5% 44.3 m 11.7 m INFO 17:05:43,762 ProgressMeter - chr14:67024340 3.2182506E7 33.1 m 61.0 s 74.5% 44.4 m 11.3 m INFO 17:06:13,763 ProgressMeter - chr14:96669997 3.2582511E7 33.6 m 61.0 s 75.6% 44.5 m 10.9 m INFO 17:06:43,764 ProgressMeter - chr14:133001237 3.3182517E7 34.1 m 61.0 s 76.8% 44.4 m 10.3 m INFO 17:07:13,766 ProgressMeter - chr15:27355549 3.3668498E7 34.6 m 61.0 s 77.8% 44.5 m 9.9 m INFO 17:07:43,772 ProgressMeter - chr15:62906952 3.4168503E7 35.1 m 61.0 s 79.0% 44.4 m 9.3 m INFO 17:08:13,773 ProgressMeter - chr15:98131719 3.466851E7 35.6 m 61.0 s 80.2% 44.3 m 8.8 m INFO 17:08:43,777 ProgressMeter - chr16:18848885 3.5157972E7 36.1 m 61.0 s 81.3% 44.4 m 8.3 m INFO 17:09:13,783 ProgressMeter - chr16:55389626 3.575798E7 36.6 m 61.0 s 82.6% 44.3 m 7.7 m INFO 17:09:43,785 ProgressMeter - chr17:6084995 3.6278968E7 37.1 m 61.0 s 83.6% 44.3 m 7.3 m INFO 17:10:13,787 ProgressMeter - chr17:40912603 3.6779029E7 37.6 m 61.0 s 84.8% 44.3 m 6.7 m INFO 17:10:36,122 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:10:36,124 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 17:10:36,124 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:10:36,124 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:10:36,127 HelpFormatter - Program Args: -nct 8 -R /project/mplattlab/data/genomes/rhemac2/rhemac2.fa -knownSites /project/mplattlab/data/variant_calling/rhemac2/intermediatebams/cayo.UG1_mpileup.vcf.gz -rf BadCigar -I /project/mplattlab/data/variant_calling/rhemac2/intermediatebams/3F4.realigned.bam -T BaseRecalibrator -BQSR /project/mplattlab/data/variant_calling/rhemac2/intermediatebams/3F4.recal.table -o /project/mplattlab/data/variant_calling/rhemac2/intermediatebams/3F4.post_recal.table INFO 17:10:36,131 HelpFormatter - Executing as snydern@node083.hpc.local on Linux 2.6.32-504.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17. INFO 17:10:36,131 HelpFormatter - Date/Time: 2016/02/22 17:10:36 INFO 17:10:36,132 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:10:36,132 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:10:36,591 GenomeAnalysisEngine - Strictness is SILENT INFO 17:10:37,116 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Bad input: The GATK report has no version specified in the header
ERROR ------------------------------------------------------------------------------------------

Note that it doesn't do this for all of my samples, but it does for the majority of them. Some of them finish the recalibration without a problem printing out: "INFO 17:39:20,763 BaseRecalibrator - Calculating quantized quality scores... INFO 17:39:20,800 BaseRecalibrator - Writing recalibration report... INFO 17:39:21,359 BaseRecalibrator - ...done! INFO 17:39:21,359 BaseRecalibrator - BaseRecalibrator was able to recalibrate 16753906 reads INFO 17:39:21,360 ProgressMeter - done 1.675392E7 32.3 m 115.0 s 100.0% 32.3 m 0.0 s INFO 17:39:21,360 ProgressMeter - Total runtime 1938.57 secs, 32.31 min, 0.54 hours INFO 17:39:21,360 MicroScheduler - 1625435 reads were filtered out during the traversal out of approximately 18379355 total reads (8.84%) INFO 17:39:21,361 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 17:39:21,361 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 17:39:21,361 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 17:39:21,361 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 17:39:21,361 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 17:39:21,361 MicroScheduler - -> 1570404 reads (8.54% of total) failing MappingQualityZeroFilter INFO 17:39:21,362 MicroScheduler - -> 55031 reads (0.30% of total) failing NotPrimaryAlignmentFilter INFO 17:39:21,362 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 17:39:22,269 GATKRunReport - Uploaded run statistics report to AWS S3 "


Created 2016-02-19 19:00:12 | Updated | Tags: baserecalibrator baserecalibration basequalityscorerecalibration

Comments (6)

Hello,

I am trying to perform base recalibration on a non-human genome. I've downloaded the dbSNP vcf file from Ensembl (also downloaded another version from NCBI), and for both files the chr naming system is "1, 2, 3...". On the other hand, the NCBI reference file I used named chromosomes differently, chr 1 = gi|966749131|ref|NC_006088.4|, chr2 = gi|966749130|ref|NC_006089.4|...

However, BaseRecalibration seemed to run fine. I am wondering if GATK is capable of internally recognizing the NCBI chromosome names or if their algorithm generated a completely wrong recal table and recalibrated all the scores wrong. I've also attached an output of my AnalyzeCovariates plot in case that helps. Thanks!


Created 2014-06-12 06:47:19 | Updated | Tags: baserecalibrator baserecalibration

Comments (2)

Hi

I wanted to know what is expected if one runs the GATK base recalibration without the dbSNP file. From what I understand from the documentation that all sites will be considered for recalibration and no known polymorphic sites will be skipped. This might lower the quality scores at known sites too. Thus one should see a decrease in the number of variants called - am I right ? I did it by mistake for some of my samples and see an increase in the number of somatic variants called (both tumor and normal were recalibrated without the dbSNP file). Would these extra variants be reliable ? Any explanations for this ?

Thanks


Created 2013-05-09 06:14:39 | Updated | Tags: baserecalibration

Comments (2)

In GATK 1.x there was an option to discard the original quality scores for a recalibrated bam using the switch: --doNotWriteOriginalQuals I think during TableRecalibration

Is there an equivalent in GATK v 2.x since as far as I can see t here is no option in PrintReads after BaseRecalibrator to discard original quality scores in favour of recalibrated quality scores. My intention is to keep the file sizes down as much as possible Thanks


Created 2013-04-23 07:13:44 | Updated 2013-04-23 07:16:35 | Tags: printreadswalker baserecalibration

Comments (2)

Hi

I am using GATK for resequencing data of cattle genome

Right now I am using BaseRecalibration and after recalibration I have my GATK report file

For recalibrated bam file my commandline is -

Sample]$ java -Xmx30g -jar GenomeAnalysisTK.jar -T PrintReads -I Sample_realigned.bam -R reference/umd3.1_genome.fa -BQSR Sample_rcal.grp -o Sample_rcal.bam -S LENIENT

But getting problem after processing of 11.8% data, The error message is follows

INFO 20:11:01,971 ProgressMeter - Chr3:12544311 9.15e+07 12.1 h 7.9 m 11.5% 4.4 d 92.8 h INFO 20:12:08,586 ProgressMeter - Chr3:14795199 9.24e+07 12.1 h 7.9 m 11.6% 4.3 d 92.2 h INFO 20:13:12,670 ProgressMeter - Chr3:18812278 9.32e+07 12.1 h 7.8 m 11.8% 4.3 d 91.0 h INFO 20:31:46,929 ProgressMeter - Chr3:19688330 9.34e+07 12.4 h 8.0 m 11.8% 4.4 d 93.0 h INFO 20:31:48,283 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: There is no space left on the device, so writing failed

but I tried the same command with intervals and it worked , I have my output bam file with following command

Sample]$ java -Xmx30g -jar GenomeAnalysisTK.jar -T PrintReads -I Sample_realigned.bam -L Chr25 -R reference/umd3.1_genome.fa -BQSR Sample_rcal.grp -o Sample_rcal.bam -S LENIENT

Plaese, could you help me?

Thanks

Ashutosh


Created 2012-11-28 11:47:29 | Updated | Tags: indelrealigner picard microscheduler baserecalibration createtarget

Comments (2)

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!


Created 2012-11-12 22:55:36 | Updated 2013-01-07 20:06:43 | Tags: indelrealigner bqsr best-practices baserecalibration

Comments (1)

Hello,

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

  • MarkDuplicates
  • Count Covariates
  • Table Recalibration
  • Realigner Target Creator
  • Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.

Mike


Created 2012-08-29 10:02:02 | Updated 2013-01-07 20:42:47 | Tags: best-practices readgroup realignment baserecalibration

Comments (3)

Hello,

I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants

Samtools seems to run properly but no file (.vcf and .bcf) are created and no error message is prompted :

cd Sample_09

  • samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50
  • bcftools view -bvcg - [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 [bcfview] 100000 sites processed. [afs] 0:89274.054 1:6411.053 2:4314.893 [bcfview] 200000 sites processed. [afs] 0:89100.642 1:6125.883 2:4773.474 [bcfview] 300000 sites processed. [afs] 0:87374.996 1:7439.238 2:5185.766 [bcfview] 400000 sites processed. [afs] 0:87890.186 1:7087.628 2:5022.185 [bcfview] 500000 sites processed. [afs] 0:85322.061 1:8454.843 2:6223.096 [bcfview] 600000 sites processed. [afs] 0:85864.368 1:8410.777 2:5724.854 [bcfview] 700000 sites processed. [afs] 0:88813.814 1:6828.001 2:4358.185 [bcfview] 800000 sites processed. [afs] 0:89070.318 1:6302.924 2:4626.758 [bcfview] 900000 sites processed. [afs] 0:88364.380 1:6796.962 2:4838.658 [bcfview] 1000000 sites processed. [afs] 0:86892.531 1:7268.088 2:5839.381 [bcfview] 1100000 sites processed. [afs] 0:87184.845 1:7153.073 2:5662.081 [bcfview] 1200000 sites processed. [afs] 0:86762.756 1:7241.236 2:5996.008 [bcfview] 1300000 sites processed. [afs] 0:89346.143 1:6159.989 2:4493.868 [bcfview] 1400000 sites processed. [afs] 0:88658.355 1:7160.555 2:4181.089 [bcfview] 1500000 sites processed. [afs] 0:85985.305 1:8308.039 2:5706.656 [bcfview] 1600000 sites processed. [afs] 0:87346.636 1:7708.883 2:4944.480 [afs] 0:63097.202 1:3950.127 2:3572.670
  • bcftools view .bcf
  • cd ..

I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.

Thanks in advance for any advice you can give me.

Chris