Base quality score recalibration

From GSA

Jump to: navigation, search

Tutorial slides describing how to use the software can be found here.

Detailed information about command line options for CountCovariates can be found here.

Detailed information about command line options for TableRecalibration can be found here.

Detailed information about command line options for AnalyzeCovariates can be found here.

Contents

Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

This process is accomplished by analyzing the covariation among several features of a base. For example:

  • Reported quality score
  • The position within the read
  • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See #Expected outputs for examples of pre and post corrected values.

The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

Downloading the files

  • Download the base quality score recalibrator (now embedded in the Genome Analysis Toolkit, early access release) from the GSA FTP server
  • Get the GATK resource bundle, which includes common input files for humans. The appropriate dbSNP file to use with CountCovariates for human data is the most recent version of dbSNP, in VCF format, currently dbsnp_132. The dbsnp_132.excluding_sites_after_129 file should not be used for recalibration.

CountCovariates

Detailed information about command line options for CountCovariates can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

  • Read group the read belongs to X assigned quality score X machine cycle producing this base X current base + previous base (dinucleotide)

For each of such bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, CountCovariates produces a file called my_reads.recal_data.csv, which contains the data needed to recalibrate reads. The format of this file is described below.

TableRecalibration

Detailed information about command line options for TableRecalibration can be found here.

After counting covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recal_data.csv file, into a new BAM file.

This step uses the recalibration table data in my_reads.recal_data.csv produced by CountCovariates to recalibrate the quality scores in my_reads.bam, writing out a new BAM file my_reads.recal.bam with recalibrated QUAL field values.

Effectively the new quality score is the sum of the global difference between reported quality scores and the empirical quality, plus the quality bin specific shift, plus the cycle x qual and dinucleotide x qual effect. Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

AnalyzeCovariates.jar

Detailed information about command line options for AnalyzeCovariates can be found here.

After counting covariates in either the initial BAM File or again in the recalibrated BAM File, an analysis tool is available which reads the .csv file and outputs several PDF (and .dat) files for each read group in the given BAM. These PDF files graphically show the various metrics and characteristics of the reported quality scores (often in relation to the empirical qualities). In order to show that any biases in the reported quality scores have been generally fixed through recalibration one should run CountCovariates again on a bam file produced by TableRecalibration. In this way users can compare the analysis plots generated by pre-recalibration and post-recalibration .csv files. Our usual chain of commands that we use to generate plots of residual error is: CountCovariates, TableRecalibrate, samtools index on the recalibrated bam file, CountCovariates again on the recalibrated bam file, and then AnalyzeCovariates on both the before and after recal_data.csv files to see the improvement in recalibration.

While running AnalyzeCovariates.jar, you will see output that looks something like

Reading in input csv file...
...Done!
Writing out intermediate tables for R...
Writing out data tables for read group: ERR001298    with 2204608 observations    and aggregate residual error = 0.406
Writing out data tables for read group: ERR001299    with 2856867 observations    and aggregate residual error = 0.471
Writing out data tables for read group: SRR001118    with 501422 observations     and aggregate residual error = 0.107
Writing out data tables for read group: SRR001804    with 1856272 observations    and aggregate residual error = 0.519
Writing out data tables for read group: SRR001117    with 508177 observations     and aggregate residual error = 0.140
Writing out data tables for read group: SRR001803    with 1853374 observations    and aggregate residual error = 0.469
Writing out data tables for read group: SRR006419    with 752757 observations     and aggregate residual error = 0.269
Writing out data tables for read group: SRR001806    with 1929180 observations    and aggregate residual error = 0.425
...Done!
Calling analysis R scripts and writing out figures...
Analyzing read group: ERR001298
Analyzing read group: ERR001299
Analyzing read group: SRR001118
Analyzing read group: SRR001804
Analyzing read group: SRR001117
Analyzing read group: SRR001803
Analyzing read group: SRR006419
Analyzing read group: SRR001806
...Done!

Example output plots are available in the "Example pre and post recalibration results" section below.

System details

  • Aligned BAM-formatted reads: my_reads.bam
    • Must be aligned to the provided FASTA reference sequence and sorted in coordinate order.
    • You can use Picard or samtools to sort the file. See Input_files_for_the_GATK for more information. The GATK uses the BAM header to determine if the file is sorted, so you'll need to make sure the header correctly indicates the SO of the BAM before the GATK will process it. Picard will write a correct SO TAG after sorting a file.
  • FASTA-formatted reference file: Homo_sapiens_assembly18.fasta
    • standard FASTA formatted reference file. Also requires associated .dict and .fai files. See Input_files_for_the_GATK for details
    • The example above are for a BAM reads file aligned to one of the two standard human reference, hg18. The resources bundle also includes b36, which the 1000 genomes project uses for their reference.
  • known variants : dbSNP VCF
    • This is a sorted and filtered version of the dbSNP database dump. See the See Input_files_for_the_GATK for more information if you'd like to understand this file format.
    • The above file is for hg18. Please use the appropriate dbsnp file from the GATK resource bundle
  • The recalibration system is read-group aware. It separates the covariate data by read group in the recal .csv file (using @RG tags) and TableRecalibration will apply this data for each read group in the file. We routinely process BAM files with hundreds or thousands of read groups. Please note that the memory requirements scale linearly with the number of read groups in the file. A 5000 RG file can easily require 10GB of RAM to store all of the covariate data.
  • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
  • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
  • The table recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 1). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Available covariates

The covariates that have been implemented so far include (complete set available only with full git checkout):

  • CycleCovariate: The machine cycle for this base (different definition for the various technologies and therefore platform [@PL tag] is pulled out of the read's read group).
  • DinucCovariate: The combination of this base and the previous base.
  • HomopolymerCovariate: The number of consecutive previous bases that match the current base.
  • MappingQualityCovariate: The mapping quality assigned to this read by the aligner.
  • MinimumNQSCovariate: The minimum base quality score in a small window in the read around this base.
  • PositionCovariate: The position along the length of the read. For Illumina this is the same as machine cycle but that is not the case for the other platforms.
  • PrimerRoundCovariate: The primer round for this base (only meaningful for SOLiD reads).
  • QualityScoreCovariate: The reported base quality score for this base.
  • ReadGroupCovariate: The read group this read is a member of.

Software Dependencies

The recalibrator currently depends on the following applications. Please install these before proceeding.

  • Java Runtime Environment 1.6.0_16 or later
  • If you want to build your own reference files or need to manipulate your BAM files:
    • samtools 0.1.4 or later. for preparation of .fai fasta index files
    • Picard to prepare the .dict dictionary files

Example pre and post recalibration results

  • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
  • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.  
If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.
I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.
All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.
I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I've split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?
Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.
I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs. ;

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by using the --process_nth_locus command line option or by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

File:Downsampling recal.png

Support

For support or discussion of feature requests please participate in the GATK community at http://getsatisfaction.com/gsa

Personal tools