This script converts a VCF file from one reference build to another. It runs 3 modules within our toolkit that are necessary for lifting over a VCF.
1. LiftoverVariants walker
2. sortByRef.pl to sort the lifted-over file
3. Filter out records whose ref field no longer matches the new reference
The liftOverVCF.pl script is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.
./liftOverVCF.pl -vcf calls.b36.vcf \ -chain b36ToHg19.broad.over.chain \ -out calls.hg19.vcf \ -gatk /humgen/gsa-scr1/ebanks/Sting_dev -newRef /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19 -oldRef /humgen/1kg/reference/human_b36_both -tmp /broad/shptmp [defaults to /tmp]
Running the script with no arguments will show the usage:
Usage: liftOverVCF.pl -vcf <input vcf> -gatk <path to gatk trunk> -chain <chain file> -newRef <path to new reference prefix; we will need newRef.dict, .fasta, and .fasta.fai> -oldRef <path to old reference prefix; we will need oldRef.fasta> -out <output vcf> -tmp <temp file location; defaults to /tmp>
Chain files from b36/hg18 to hg19 are located here within the Broad:
External users can get them off our ftp site:
location: ftp.broadinstitute.org username: gsapubftp-anonymous path: Liftover_Chain_Files
Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.
The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.
Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.
Usage of GenotypeAndValidate and its command line arguments are described here.
The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.
GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:
|called alt||True Positive (TP)||False Positive (FP)||Positive PV|
|called ref||False Negative (FN)||True Negative (TN)||Negative PV|
The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.
The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.
The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).
You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)
The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).
Genotypes BAM file from new technology using the VCF as a truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myNewTechReads.bam \ -alleles handAnnotatedVCF.vcf \ -BTI alleles \ -o gav.vcf
An annotated VCF example (info field clipped for clarity)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
Using a BAM file as the truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myTruthDataset.bam \ -alleles callsToValidate.vcf \ -BTI alleles \ -bt \ -o gav.vcf
Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:
Note that earlier versions of the GATK used a different tool.
For a complete, detailed argument reference, refer to the GATK document page here
This tool generates amplicon sequences for use with the Sequenom primer design tool. The output of this tool is fasta-formatted, where the characters [A/B] specify the allele to be probed (see Validation Amplicons Output further below). It can mask nearby variation (either by 'N' or by lower-casing characters), and can try to restrict sequenom design to regions of the amplicon likely to generate a highly specific primer. This tool will also flag sites with properties that could shift the mass-spec peak from its expected value, such as indels in the amplicon sequence, SNPs within 4 bases of the variant attempting to be probed, or multiple variants selected for validation falling into the same amplicon.
Ns in the amplicon sequence instructs primer design software (such as Sequenom) not to use that base in the primer: any primer will fall entirely before, or entirely after, that base. Lower-case letters instruct the design software to try to avoid using the base (presumably by applying a penalty for doing so), but will not prevent it from doing so if a good primer (i.e. a primer with suitable melting temperature and low probability of hairpin formation) is found.
ValidationAmplicons relies on the GATK Sting BWA/C bindings to assess the specificity of potential primers. The wiki page for Sting BWA/C bindings contains required information about how to download the appropriate version of BWA, how to create a BWT reference, and how to set your classpath appropriately to run this tool. If you have not followed the directions to set up the BWA/C bindings, you will not be able to create validation amplicon sequences using the GATK. There is an argument (see below) to disable the use of BWA, and lower repeats within the amplicon only. Use of this argument is not recommended.
Validation Amplicons requires three input files: a VCF of alleles you want to validate, a VCF of variants you want to mask, and a Table of intervals around the variants describing the size of the amplicons. For instance:
Alleles to Validate
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO 20 207414 . G A 85.09 PASS . // SNP to validate 20 792122 . TCCC T 22.24 PASS . // DEL to validate 20 994145 . G GAAG 48.21 PASS . // INS to validate 20 1074230 . C T 2.29 QD . // SNP to validate (but filtered) 20 1084330 . AC GT 42.21 PASS . // MNP to validate
HEADERpos name 20:207334-207494 20_207414 20:792042-792202 20_792122 20:994065-994225 20_994145 20:1074150-1074310 20_1074230 20:1084250-1084410 20_1084330
Alleles to Mask
##fileformat=VCFv4.1 #CHROM POS ID REF ALT QUAL FILTER INFO 20 207414 . G A 77.12 PASS . 20 207416 . A AGGC 49422.34 PASS . 20 792076 . A G 2637.15 HaplotypeScore . 20 792080 . T G 161.83 PASS . 20 792087 . CGGT C 179.84 ReadPosRankSum . 20 792106 . C G 32.59 PASS . 20 792140 . C G 409.75 PASS . 20 1084319 . T A,C 22.24 PASS . 20 1084348 . TACCACCCCACACA T 482.84 PASS .
The output from Validation Amplicons is a fasta-formatted file, with a small adaptation to represent the site being probed. Using the test files above, the output of the command
java -jar $GATK/dist/GenomeAnalysisTK.jar \ -T ValidationAmplicons \ -R /humgen/1kg/reference/human_g1k_v37.fasta \ -BTI ProbeIntervals \ --ProbeIntervals:table interval_table.table \ --ValidateAlleles:vcf sites_to_validate.vcf \ --MaskAlleles:vcf mask_sites.vcf \ --virtualPrimerSize 30 \ -o probes.fasta \ -l WARN
>20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414 CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC >20:792122 Valid 20_792122 TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT >20:994145 Valid 20_994145 TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC >20:1074230 SITE_IS_FILTERED=1, 20_1074230 ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC >20:1084330 DELETION=1, 20_1084330 CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCC
Note that SNPs have been masked with 'N's, filtered 'mask' variants do not appear, the insertion has been flanked by Ns, the unfiltered deletion has been replaced by Ns, and the filtered site in the validation VCF is not marked as valid. In addition, bases that fall inside at least one non-unique 30-mer (meaning no multiple MQ0 alignments using BWA) are lower-cased. The identifier for each sequence is the position of the allele to be probed, a 'validation status' (defined below), and a string representing the amplicon. Validation status values are:
Valid // amplicon is valid SITE_IS_FILTERED=1 // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant") VARIANT_TOO_NEAR_PROBE=1 // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak MULTIPLE_PROBES=1, // multiple variants to be validated found inside the same amplicon DELETION=6,INSERTION=5, // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate DELETION=1, // deletion found inside the amplicon region, could shift mass-spec peak START_TOO_CLOSE, // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer END_TOO_CLOSE, // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer NO_VARIANTS_FOUND, // no variants found within the amplicon region INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself)
The files provided to Validation Amplicons should be such that all generated amplicons are valid. That means:
There are no variants within 4bp of the site to be validated There are no indels in the amplicon region Amplicon windows do not include other sites to be probed Amplicon windows are not too short, and the variant therein is not within 50bp of either edge All amplicon windows contain a variant to be validated Variants to be validated are unfiltered or pass filters
The tool will warn you each time any of these conditions are not met.
ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF.
For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector.
The -sampleMode argument controls the mode of sample-based site consideration. The options are:
Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95
The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are:
Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.
Take the master sites VCF and genotype each sample BAM file at these sites
(Optionally) Merge the single sample VCFs into a master VCF file
The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 20 9999996 . A ATC . PASS . GT:GQ 0/1:30 20 10000000 . T G . PASS . GT:GQ 0/1:30 20 10000117 . C T . FAIL . GT:GQ 0/1:30 20 10000211 . C T . PASS . GT:GQ 0/1:30 20 10001436 . A AGG . PASS . GT:GQ 1/1:30
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 9999996 . A ATC . PASS . GT:GQ 0/1:30 20 10000117 . C T . FAIL . GT:GQ 0/1:30 20 10000211 . C T . FAIL . GT:GQ 0/1:30 20 10000598 . T A . PASS . GT:GQ 1/1:30 20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30
In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:
20 9999996 . A ATC . PASS . GT:GQ 0/1:30 [pass in both] 20 10000000 . T G . PASS . GT:GQ 0/1:30 [only in batch 1] 20 10000117 . C T . FAIL . GT:GQ 0/1:30 [fail in both] 20 10000211 . C T . FAIL . GT:GQ 0/1:30 [pass in 1, fail in 2, choice in unclear] 20 10000598 . T A . PASS . GT:GQ 1/1:30 [only in batch 2] 20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30 [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]
These issues fall into the following categories:
There are two difficult situations that must be addressed by the needs of the project merging batches:
Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.
The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:
java -jar dist/GenomeAnalysisTK.jar \ -T CombineVariants \ -R human_g1k_v37.fasta \ -V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \ --sites_only \ -minimalVCF \ -o master.vcf
producing the following VCF:
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO 20 9999996 . A ACT . PASS set=Intersection 20 10000000 . T G . PASS set=one 20 10000117 . C T . FAIL set=FilteredInAll 20 10000211 . C T . PASS set=filterIntwo-one 20 10000598 . T A . PASS set=two 20 10001436 . A AGG,AGGCT . PASS set=Intersection
Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.
As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:
java -Xmx2g -jar dist/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R bundle/b37/human_g1k_v37.fasta \ -I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \ -alleles master.vcf \ -L master.vcf \ -gt_mode GENOTYPE_GIVEN_ALLELES \ -out_mode EMIT_ALL_SITES \ -stand_call_conf 0.0 \ -glm BOTH \ -G none \
-L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.
The last item,-G ` prevents the UG from computing annotations you don't need. This command produces something like the following output:
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 9999996 . A ACT 4576.19 . . GT:DP:GQ:PL 1/1:76:99:4576,229,0 20 10000000 . T G 0 . . GT:DP:GQ:PL 0/0:79:99:0,238,3093 20 10000211 . C T 857.79 . . GT:AD:DP:GQ:PL 0/1:28,27:55:99:888,0,870 20 10000598 . T A 1800.57 . . GT:AD:DP:GQ:PL 1/1:0,48:48:99:1834,144,0 20 10001436 . A AGG,AGGCT 1921.12 . . GT:DP:GQ:PL 0/2:49:84.06:1960,2065,0,2695,222,84
Several things should be noted here:
This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:
foreach sample in samples: run UnifiedGenotyper command above with -I $sample.bam -o $sample.vcf end
You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:
java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf
For a complete, detailed argument reference, refer to the GATK document page here
The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.
The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.
The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 chr20 332341 rs6076509 A G 470.60 PASS AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57 GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99 chr20 332503 rs6133033 C G 726.23 PASS AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75 GT:DP:GL:GQ:PQ 1|0:60:-110.83,-18.08,-149.73:99:126.93
The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. Note that for cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.
For example, consider the following records from the VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 SAMP2 chr1 1 . A G 99 PASS . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 2 . A G 99 PASS . GT:GL:GQ:PQ 1|1:-100,0,-100:99:60 0|1:-100,0,-100:99:50 chr1 3 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:60 0|0:-100,0,-100:99:60 chr1 4 . A G 99 FAIL . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 5 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:70 1|0:-100,0,-100:99:60 chr1 6 . A G 99 PASS . GT:GL:GQ:PQ 0/1:-100,0,-100:99 1|1:-100,0,-100:99:70 chr1 7 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:80 0|1:-100,0,-100:99:70 chr1 8 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:90 0|1:-100,0,-100:99:80
The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:
And two haplotypes at positions 6-8:
And, SAMP2 has the two haplotypes at positions 1-8:
Detailed information about command line options for BaseRecalibrator can be found here.
The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.
New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.
This process is accomplished by analyzing the covariation among several features of a base. For example:
These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.
For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.
The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.
Detailed information about command line options for BaseRecalibrator can be found here.
This GATK processing step walks over all of the reads in
my_reads.bam and tabulates data about the following features of the bases:
For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called
my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.
To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:
java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam
After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the
recalibration_report.grp file, into a new BAM file.
This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.
Effectively the new quality score is:
Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.
Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.
The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:
This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).
Example Arguments table:
#:GATKTable:true:1:17::; #:GATKTable:Arguments:Recalibration argument collection values used in this run Argument Value covariate null default_platform null deletions_context_size 6 force_platform null insertions_context_size 6 ...
The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.
The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.
Example Arguments table:
#:GATKTable:true:2:94:::; #:GATKTable:Quantized:Quality quantization map QualityScore Count QuantizedScore 0 252 0 1 15972 1 2 553525 2 3 2190142 9 4 5369681 9 9 83645762 9 ...
This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:; #:GATKTable:RecalTable0: ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors SRR032768 D 40.7476 45.0000 2642683174 222475 SRR032766 D 40.9072 45.0000 2630282426 213441 SRR032764 D 40.5931 45.0000 2919572148 254687 SRR032769 D 40.7448 45.0000 2850110574 240094 SRR032767 D 40.6820 45.0000 2820040026 241020 SRR032765 D 40.9034 45.0000 2441035052 198258 SRR032766 M 23.2573 23.7733 2630282426 12424434 SRR032768 M 23.0281 23.5366 2642683174 13159514 SRR032769 M 23.2608 23.6920 2850110574 13451898 SRR032764 M 23.2302 23.6039 2919572148 13877177 SRR032765 M 23.0271 23.5527 2441035052 12158144 SRR032767 M 23.1195 23.5852 2820040026 13750197 SRR032766 I 41.7198 45.0000 2630282426 177017 SRR032768 I 41.5682 45.0000 2642683174 184172 SRR032769 I 41.5828 45.0000 2850110574 197959 SRR032764 I 41.2958 45.0000 2919572148 216637 SRR032765 I 41.5546 45.0000 2441035052 170651 SRR032767 I 41.5192 45.0000 2820040026 198762
This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable1: ReadGroup QualityScore EventType EmpiricalQuality Observations Errors SRR032767 49 M 33.7794 9549 3 SRR032769 49 M 36.9975 5008 0 SRR032764 49 M 39.2490 8411 0 SRR032766 18 M 17.7397 16330200 274803 SRR032768 18 M 17.7922 17707920 294405 SRR032764 45 I 41.2958 2919572148 216637 SRR032765 6 M 6.0600 3401801 842765 SRR032769 45 I 41.5828 2850110574 197959 SRR032764 6 M 6.0751 4220451 1041946 SRR032767 45 I 41.5192 2820040026 198762 SRR032769 6 M 6.3481 5045533 1169748 SRR032768 16 M 15.7681 12427549 329283 SRR032766 16 M 15.8173 11799056 309110 SRR032764 16 M 15.9033 13017244 334343 SRR032769 16 M 15.8042 13817386 363078 ...
This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable2: ReadGroup QualityScore CovariateValue CovariateName EventType EmpiricalQuality Observations Errors SRR032767 16 TACGGA Context M 14.2139 817 30 SRR032766 16 AACGGA Context M 14.9938 1420 44 SRR032765 16 TACGGA Context M 15.5145 711 19 SRR032768 16 AACGGA Context M 15.0133 1585 49 SRR032764 16 TACGGA Context M 14.5393 710 24 SRR032766 16 GACGGA Context M 17.9746 1379 21 SRR032768 45 CACCTC Context I 40.7907 575849 47 SRR032764 45 TACCTC Context I 43.8286 507088 20 SRR032769 45 TACGGC Context D 38.7536 37525 4 SRR032768 45 GACCTC Context I 46.0724 445275 10 SRR032766 45 CACCTC Context I 41.0696 575664 44 SRR032769 45 TACCTC Context I 43.4821 490491 21 SRR032766 45 CACGGC Context D 45.1471 65424 1 SRR032768 45 GACGGC Context D 45.3980 34657 0 SRR032767 45 TACGGC Context D 42.7663 37814 1 SRR032767 16 AACGGA Context M 15.9371 1647 41 SRR032764 16 GACGGA Context M 18.2642 1273 18 SRR032769 16 CACGGA Context M 13.0801 1442 70 SRR032765 16 GACGGA Context M 15.9934 1271 31 ...
The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.
If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.
I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.
All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.
I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?
Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.
I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.
The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.
However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:
For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.
This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.
As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.
Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box.
Detailed information about command line options for VariantRecalibrator can be found here.
Detailed information about command line options for ApplyRecalibration can be found here.
The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.
The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.
The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.
Please see the VQSR tutorial for step-by-step instructions on running these tools.
The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.
During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.
The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.
The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.
In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.
The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).
An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!
The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.
The first tranche (from the bottom, with lowest values) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.
This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.
Note that the tranches plot is not applicable for indels.
We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:
We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.
This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.
One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding
--maxGaussians 4 to your command line.
maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.
The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.
There are two types of GATK tools that are able to use pedigree (family structure) information:
The two variant callers (HaplotypeCaller and the deprecated UnifiedGenotyper) as well as VariantAnnotator and GenotypeGVCFs are all able to use pedigree information if you request an annotation that involves population structure (e.g. Inbreeding Coefficient). To be clear though, the pedigree information is not used during the variant calling process; it is only used during the annotation step at the end.
If you already have VCF files that were called without pedigree information, and you want to add pedigree-related annotations (e.g to use Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as a feature annotation), don't panic. Just run the latest version of the VariantAnnotator to re-annotate your variants, requesting any missing annotations, and make sure you pass your PED file to the VariantAnnotator as well. If you forget to provide the pedigree file, the tool will run successfully but pedigree-related annotations may not be generated (this behavior is different in some older versions).
The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.
For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.