Tagged with #vqsr
2 documentation articles | 0 events or announcements | 19 forum discussions


Slides that explain the VQSR methodology as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.

Introduction

The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model. The variant recalibrator contrastively evaluates variants in a two step process:

  • VariantRecalibration - Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants.
  • ApplyRecalibration - Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the lod threshold as provided by the user (with the ts_filter_level parameter).

Recalibration tutorial with example HiSeq, single sample, deep coverage, whole genome call set

By way of explaining how one uses the variant quality score recalibrator and evaluating its performance we have put together this tutorial which uses example sequencing data produced at the Broad Institute. All of the data used in this tutorial is available in VCF format from our GATK resource bundle.

Input call set

input: NA12878.HiSeq.WGS.bwa.cleaned.raw.b37.subset.vcf

  • These calls were generated with the UnifiedGenotyper from a 30X coverage modern, single sample run of HiSeq. They were randomly downsampled to keep the file size small but in general one would want to use the full set of variants available genome-wide for this procedure. No other pre-filtering steps were applied to the raw output.

Training sets

HapMap 3.3: hapmap_3.3.b37.sites.vcf

  • These high quality sites are used both to train the Gaussian mixture model and then again when choosing a LOD threshold based on sensitivity to truth sites.
  • The parameters for these sites will be: known = false, training = true, truth = true, prior = Q15 (96.84%)

Omni 2.5M chip: 1000G_omni2.5.b37.sites.vcf

  • These polymorphic sites from the Omni genotyping array are used when training the model.
  • The parameters for these sites will be: known = false, training = true, truth = false, prior = Q12 (93.69%)

dbSNP build 132: dbsnp_132.b37.vcf

  • The dbsnp sites are generally considered to be not of high enough quality to be used in training but here we stratify output metrics such as ti/tv ratio by presence in dbsnp (known sites) or not (novel sites).
  • The parameters for these sites will be: known = true, training = false, truth = false, prior = Q8 (84.15%)

The default prior for all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the UnifiedGenotyper is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

VariantRecalibrator

Detailed information about command line options for VariantRecalibrator can be found here.

Build a Gaussian mixture model using a high quality subset of the input variants and evaluate those model parameters over the full call set. The following notes describe the appropriate inputs to use for this tool.

  • Note that this walker expects call sets in which each record has been appropriately annotated (see e.g. VariantAnnotator). Input call set rod bindings must start with "input". See the command line below.
  • When constructing an initial call set (see e.g. Unified Genotyper or Haplotype Caller) for use with the Recalibrator, it's generally best to turn down the confidence threshold to allow more borderline calls (trusting the Recalibrator to keep the real ones while filtering out the false positives). For example, we often use a Q20 threshold on our deep coverage calls with the Recalibrator (whereas the default threshold in the UnifiedGenotyper is Q30).
  • No pre-filtering is necessary when using the Recalibrator. See below for the advanced options which allow the user to selectively ignore certain filters if they have already been applied to your call set.
  • The tool accepts any ROD bindings when specifying the set of truth sites to be used during modeling. Information about how to download VCF files which we routinely use for training is in the FAQ section at the bottom of the page.
  • Each training set ROD binding is specified with key-value tags to qualify whether the set should be considered as known sites, training sites, and/or truth sites. Additionally, the prior probability of being true for those sites is also specified via these tags in Phred scale. See the command line below for an example. An explanation for how each of the training sets is used by the algorithm:
  • Training sites: Input variants which are found to overlap with these training sites are used to build the Gaussian mixture model.
  • Truth sites: When deciding where to set the cutoff in VQSLOD sensitivity to these truth sites is used. Typically one might want to say I dropped my threshold until I got back 99% of HapMap sites, for example.
  • Known sites: The known / novel status of a variant isn't used by the algorithm itself and is only used for reporting / display purposes. The output metrics are stratified by known status in order to aid in comparisons with other call sets.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The first tranche is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibator walker, is shown on the right.

Tranches plot for example HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

  • The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric [YES!]
  • The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
  • The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

ApplyRecalibration

Detailed information about command line options for ApplyRecalibration can be found here.

Using the tranche file generated by the previous step the ApplyRecalibration walker looks at each variant's VQSLOD value and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level have their filter field annotated with its tranche level. This will result in a call set that simultaneously is filtered to the desired level but also has the information necessary to pull out more variants at a slightly lower quality level.

Frequently Asked Questions

How do I know which annotations to use for my data?

The five annotation values provided in the command lines above (QD, HaplotypeScore, MQRankSum, ReadPosRankSum, and HRun) have been show to give good results for a variety of data types. However this shouldn't be taken to mean these annotations give the absolute best modeling for every source of sequencing data. Better results could possibly be achieved through experimentation with which SNP annotations are used in the algorithm. The goal is to find annotation values with are approximately Gaussianly distributed and also serve to separate the probably true (known) SNPs from the probably false (novel) SNPs.

How do I know which -tranche arguments to pass into the VariantRecalibrator step?

The -tranche arguments main purpose is to create the tranche plot (as shown above). They are meant to convey the idea that with real, calibrated variant quality scores one can create call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way an end user can choose to use some of the filtered records or only use the PASSing records. For new users to the variant quality score recalibrator perhaps the easiest thing to do in the beginning is simply select the single desired false discovery rate and pass that value in as a single -tranche argument to make sure that the desired rate can be achieved given the other parameters to the algorithm.

What should I use as training data?

The VariantRecalibrator step accept lists of truth and training sites in several formats (dbsnp ROD, VCF, and BED, for example). Any list can be used but it is best to use only those sets which are of the best quality. The truth sets are passed into the algorithm using any rod binding name and their truth or training status is specified with rod tags (see VariantRecalibrator section above). We routinely use the HapMap v3.3 VCF file and the Omni2.5M SNP chip array in training the model. In general the false positive rate of dbsnp sites is too high to be used reliably for training the model.

HapMap v3.3 as well as the Omni validation array VCF files are available in our GATK resource bundle.

Does the VQSR work with non-human variant calls?

Absolutely! The VQSR accepts any list of sites to use as training / truth data, not just HapMap.

Don't have any truth data for your organism? No problem. There are several things one might experiment with. One idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP caller, of which the GATK is one, and use those sites which are concordant between the different methods as truth data. There are many fruitful avenues of research here. Hopefully the model reporting plots help facilitate this experimentation. Perhaps the best place to begin is to use a line like the following when specifying the truth set:

-resource:concordantSet,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf

Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties. One piece of advice is to turn down the number of Gaussians used during training and to turn up the number of variants that are used to train the negative model. This can be accomplished by adding --maxGaussians 4 --percentBad 0.05 to your command line.

percentBadVariants is the proportion of variants that the program will use to build the negative model. If you have a small callset, you need to use a larger proportion in order to have enough "bad" variants to build the negative model. maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.

VariantRecalibrator

For use with calls generated by the UnifiedGenotyper and HaplotypeCaller

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

   -percentBad 0.01 -minNumBad 1000 \
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP \
   -mode SNP \

Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

Important notes about annotations

Some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline)
  • Use the VQSR with the smaller variant callset but experiment with the precise argument settings (try adding --maxGaussians 4 --percentBad 0.05 to your command line, for example)

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle. Arguments for VariantRecalibrator:

   --maxGaussians 4 -percentBad 0.01 -minNumBad 1000 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an DP -an FS -an ReadPosRankSum -an MQRankSum \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.

ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

For use with calls generated by the UnifiedGenotyper and HaplotypeCaller

Common, base command line

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
 

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.

   --ts_filter_level 99.9 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.

   --ts_filter_level 99.9 \
   -mode INDEL \
Sorry, there are no publicly available documents of this type with the tag #vqsr. Try one of the other types.

Hi,

I just run HyplotypeCaller on a dataset. For the same dataset, I have run through Unified genotyper and then directly subjected the raw vcf from UG to VQSR step without the help of VariantAnnotator before and get through VQSR without any problem. However, when I try to subject the raw callset derived from HyplotypeCaller directly to VQSR step, the VQSR module complained about it and error message is below:

...

ERROR MESSAGE: Bad input: Values for HaplotypeScore annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See

http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator

So after HyplotypeCaller, the derived vcf file needs to run though VariantAnnotator? Since Unified genotyper derived callset does not need the help of VariantAnnotator (all annotations needed for VQSR are included after UG), it seems not the case for HyplotypeCaller? I can run through VariantAnnotator for HyplotypeCaller derived vcf file, just want to make sure if my understanding is correct?

Thanks and best

Mike

Hello,

Does VQSR behave differently when the -out_mode flag in UnifiedGenotyper is set to EMIT_VARIANTS_ONLY as compared to EMIT_ALL_CONFIDENT_SITES. I think by using EMIT_ALL_CONFIDENT_SITES we might give VQSR more information to train the model, but I may be wrong. Can someone please help me with this ? Thanks.

cheers, Rahul

I was wondering if anyone has used VQSR for a mouse related genome project. I am working with mm10 dbsnp and DNA-seq short insert data for multiple homozygous mouse samples. I have obtained decent results so far using the mm10 dbsnp as the training set, but was curious to see if anyone had any recommendations as to what settings to use. Any input is appreciated. I also have a lot of RNA-seq data, but that will come at a much later point in time. Thanks!

Hi all, I'm running VariantRecalibrator on a SNP set (47 exomes) and I get this error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.2-3-gde33222): 
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
##### ERROR ------------------------------------------------------------------------------------------

this is the command line:

    java -Djava.io.tmpdir=/lustre2/scratch/  -Xmx32g -jar /lustre1/tools/bin/GenomeAnalysisTK-2.2-3.jar \
    -T VariantRecalibrator \
    -R /lustre1/genomes/hg19/fa/hg19.fa \
    -input /lustre1/workspace/Ferrari/Carrera/Analysis/UG/bpd_ug.SNP.vcf \
    -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 /lustre1/genomes/hg19/annotation/hapmap_3.3.hg19.sites.vcf.gz \
    -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 /lustre1/genomes/hg19/annotation/1000G_omni2.5.hg19.sites.vcf.gz \
    -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 /lustre1/genomes/hg19/annotation/dbSNP-137.chr.vcf -an QD \
    -an HaplotypeScore \
    -an MQRankSum \
    -an ReadPosRankSum \
    -an FS \
    -an MQ \
    -an DP \
    -an QD \
    -an InbreedingCoeff \
    -mode SNP \
    -recalFile /lustre2/scratch/Carrera/Analysis2/snp.ug.recal.csv \
    -tranchesFile /lustre2/scratch/Carrera/Analysis2/snp.ug.tranches \
    -rscriptFile /lustre2/scratch/Carrera/Analysis2/snp.ug.plot.R \
    -U ALLOW_SEQ_DICT_INCOMPATIBILITY \
    --maxGaussians 6

I've already tried to decrease the --maxGaussians option to 4, I've also added --percentBad option (setting it up to 0.12, as for INDEL) but I still get the error. I've added the option -debug to see what's happening, but apparently this has been removed from GATK-2.2. Any help is appreciated... thanks

Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?

Thanks!

Hi all, I've somewhere in this site that before VQSR the FP rate is expected to be around 10% (I guess for UnifiedGenotyper). Are there some updated statistics for VQRS? For HaplotypeCaller? For Exome/WG data? Another thing: we apply VQRS on all our analysis, we are trying to collect some validation statistics. We suspect that most of the FP have some particular "culprits" in VQRS (especially QD and MQ). Do you have some data about this? Best

d

Hi,

I observed a significant difference of the variant call sets from the same exomes between v1.6 and v2.2(-10). In fact, I observed a significant decrease in the overall novel TiTv in the latter call sets from around 2.6 to 2.1 at TruthSensitivity threshold at 99.0. When I looked at a sample to compare variant sites using VariantEval, it showed that

Filter JexlExpression Novelty nTi nTv tiTvRatio
called Intersection known 14624 4563 3.2
called Intersection novel 856 312 2.74
called filterIngatk22-gatk16 known 264 132 2
called filterIngatk22-gatk16 novel 28 18 1.56
called gatk16 known 3 1 3
called gatk16 novel 1 1 1
called gatk22-filterIngatk16 known 258 94 2.74
called gatk22-filterIngatk16 novel 144 425 0.34
called gatk22 known 2 2 1
called gatk22 novel 17 30 0.57
filtered FilteredInAll known 1344 649 2.07
filtered FilteredInAll novel 1076 1642 0.66

The novel TiTv of new calls in v2.2 not found in v1.6 or called in v2.2 but filtered in v1.6 demonstrated novel TiTv around 0.5. So I suspect that VQSLOD scoring (or ranking) of SNPs was changed substantially in somewhat an unfavorable way.

The major updates in v2.2 affecting my result were BQSRv2, ReduceReads, UG and VariantAnnotation. (Too many things to pin-point the culprit...) The previous BAM processing and variant calls were made using v1.6. For the new call set, I used v2.1-9 (so after serious bug fix in ReduceReads, thank you for the fix) for BQSRv2 and ReduceReads and v2.2-10 for UG and VQSR.

As a first clue, I found that distribution of FS values changed dramatically from the v1.6 (please see attached plots). Although I recognized that FS value calculations were recently updated, the distribution of previous FS values (please see attached) makes more sense for me because the current FS values do not seem to provide us information to classify true positives and false positives.

Thanks in advance. Katsuhito

I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.

For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.

In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.

Hi, I am new to GATK, I have been trying to figure a strange error that I haven't been able to resolve for days.

Process so far. 1. Run UnifiedGenotyper per chr using -L option on ~ 130 samples 2. Merge all output vcf files into one. (using tabix to gz and index each vcf file, then use vcf-concat to merge all chr* files) 3. Use a perl script to sort merged vcf file based on the reference file order. i.e (chr1, 2, 3...M) 4. Split Merged.sorted.vcf file into INDEL and SNV files. 5. Run VQSR on each file (SNV and INDEL).

Error that I get: During ApplyRecalibration for INDELs I get an error in chr9 that states that a coordinate A is after Coordinate B (A < B, and A and B are different values, each time). This always happens in chr9. I checked my input Merged.sorted.indel.vcf file around coordinate A and B and its file is in order. I checked the recal file and it is also in order. So I can't figure out where the error is coming from. The strange thing is that error is reported when GATK is creating the output file, not during its computation/applying recalibration.

Has anyone encountered such a situation before?  Or  have any ideas I should try to resolve the error.  I don't get any errors with SNVs only INDEL's

Exact error message:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to merge temporary Tribble output file. at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:259) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:103) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:248) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92) Caused by: org.broad.tribble.TribbleException$MalformedFeatureFile: We saw a record with a start of chr9:33020249 after a record with a start of chr9:34987121, for input source: /data2/bsi/secondary/multisample/Merged.variant.filter.INDEL_2.vcf at org.broad.tribble.index.DynamicIndexCreator.addFeature(DynamicIndexCreator.java:164) at org.broadinstitute.sting.utils.codecs.vcf.IndexingVCFWriter.add(IndexingVCFWriter.java:118) at org.broadinstitute.sting.utils.codecs.vcf.StandardVCFWriter.add(StandardVCFWriter.java:163) at org.broadinstitute.sting.gatk.io.storage.VCFWriterStorage.mergeInto(VCFWriterStorage.java:120) at org.broadinstitute.sting.gatk.io.storage.VCFWriterStorage.mergeInto(VCFWriterStorage.java:26) at org.broadinstitute.sting.gatk.executive.OutputMergeTask.merge(OutputMergeTask.java:48) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:253) ... 6 more

ERROR ------------------------------------------------------------------------------------------

Exact command:

/usr/java/latest/bin/java -Xmx6g -XX:-UseGCOverheadLimit -Xms512m -jar /projects/apps/alignment/GenomeAnalysisTK/latest/GenomeAnalysisTK.jar -R /data2/reference/sequence/human/ncbi/37.1/allchr.fa -et NO_ET -K /projects/apps/alignment/GenomeAnalysisTK/latest/Hossain.Asif_mayo.edu.key -mode INDEL -T ApplyRecalibration -nt 4 -input /data2/secondary/multisample/Merged.variant.INDEL.vcf.temp -recalFile /data2/secondary/multisample/temp/Merged.variant.INDEL.recal -tranchesFile /data2/secondary/multisample/temp/Merged.variant.INDEL.tranches -o /data2/secondary/multisample/Merged.variant.filter.INDEL_2.vcf

Version of GATK : 1.7 and 1.6.7

When performing VQSR, the data set has its variants overlapped with the training set, may I know if all the overlapped variants are used in the training or is it down sampled?

I am seeing this error on single human WGS sample -

The provided VCF file is malformed at approximately line number "x": there are 557 genotypes while the header requires that 1525 genotypes be present for all records

Interestingly, when I run VQSR as part of the same pipeline on the same sample consecutive times, the "x" changes to different line numbers each time. I was wondering if someone could explain the meaning of the error message more?

Hello,

I am running Variant Quality Score Recalibration on indels with the following command.

java -Xmx8g -jar /raid/software/src/GenomeAnalysisTK-1.6-9-g47df7bb/GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R /raid/references-and-indexes/hg19/bwa/hg19_lite.fa \
    -input indel_output_all_chroms_combined.vcf \
    --maxGaussians 4 -std 10.0 -percentBad 0.12 \
    -resource:mills,known=true,training=true,truth=true,prior=12.0  /raid/Merlot/exome_pipeline_v1/ref/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
    -an QD -an FS -an HaplotypeScore -an ReadPosRankSum  \
    --ts_filter_level 95.0 \
     -mode INDEL \
    -recalFile /raid2/projects/STFD/indel_output_7.recal \
    -tranchesFile /raid2/projects/STFD/indel_output_7.tranches \
    -rscriptFile /raid2/projects/STFD/indel_output_7.plots.R

My tranches file reports only false positives for all tranches. When I run VQSR on SNPS, the tranches have many true positives and look similar to other tranch files reported on this site. I am wondering if anyone has similar experiences or suggestions?

Thanks

Hi,

I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG). Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.

What is the best way to go about this?

Many thanks.

Hi Mark, Eric -

First, I wanted to thank you guys for providing advice with respect to running VQSR. I am already sold and a huge fan of the method :-).

I was wondering if either of you could comment on VQSLOD and sensitivity filter tranche? To be more specific, if I set a filter threshold of 99% for sensitivity and VQSLOD < 0 I imagine that probably is not a good idea! However, a VQSLOD of 3 or 5 may be appropriate in the statistical sense, i.e. pretty confident that this is a real variant. Finally, I am thinking we should include VQSLOD in our statistical genetic association mapping methods. I wanted to get a sense from either of you what VQSLOD you would want to completely remove from analysis?

Best Wishes,

Manny.

Hello,

I am trying to run GATK on a sample of 119 exomes. I followed the GATK guidelines to process the fastq files. I used the following parameters to call the UnifiedGenotyper and VQSR [for SNPs]:

UnifiedGenotyper

-T UnifiedGenotyper 
--output_mode EMIT_VARIANTS_ONLY 
--min_base_quality_score 30 
--max_alternate_alleles 5 
-glm SNP 

VQSR

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /media/transcription/cipn/5.pt/ref/hapmap_3.3.hg19.sites.vcf 
-resource:omni,known=false,training=true,truth=false,prior=12.0 /media/transcription/cipn/5.pt/ref/1000G_omni2.5.hg19.sites.vcf 
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /media/transcription/cipn/5.pt/ref/dbsnp_135.hg19.vcf.gz 
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff 
-mode SNP 

I get a tranche plot, which does not look OK. The "Number of Novel Variants [1000s]" goes from -400 to 800 and the Ti/Tv ratio varies from 0.633 to 0.782 [the attach file link is not working for me and am unable to upload the plot]. Any suggestion to rectify this would be very helpful !

cheers, Rahul

We have data from target sequencing genes (only targeted two genes). We analyzed the data by GATK pipeline. Since the data set is too small, we tried hard filtration on both SNP and indels. At the same time, we sequenced the same sample by whole exome sequencing and filter SNP by VQSR. The quality of VQSR results is much better than hard filtration results. For economic reason, we need to develop analysis pipeline for target sequencing, is it ok to incorporate the target sequencing data into an exome sequencing data (merge the VCF files), do VQSR? I just worried the true sites in target sequencing data have different features compared to true sites in whole exome sequencing data.

Hi

We have 100 samples run through the GATK unified genotyper and then we merged all the VCF files to run the multi samples VQSR. (merged was done using VCFTOOLS). What attributes we should use in this case.

For multi sample called vcf we use these paramters:

-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP -nt 2 --maxGaussians 4 --percentBadVariants 0.05

any help is deeply appreciated.

Thanks

Saurabh

Hi,

Recently I run into some odd observation in VQSR. I have 17 samples from a same family and I used all of 17 samples to call SNPs and after VQSR, I got the trench file like this:

Variant quality score tranches file

Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,48637,716,2.9527,2.3302,4.8390,VQSRTrancheSNP0.00to90.00,SNP,26182,23563,0.9000 99.00,60114,1531,2.8057,2.3333,1.7766,VQSRTrancheSNP90.00to99.00,SNP,26182,25920,0.9900 99.90,67220,2884,2.7190,1.8222,-10.0009,VQSRTrancheSNP99.00to99.90,SNP,26182,26155,0.9990 100.00,69714,4998,2.6822,1.8300,-1122.0698,VQSRTrancheSNP99.90to100.00,SNP,26182,26182,1.0000

which seems fine. then for research purpose, I only used 5 samples of more tight relation such as two parents and their 3 immediate children and after VQSR, the trench file looks like below:

Variant quality score tranches file

Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP0.00to90.00,SNP,20850,20850,1.0000 99.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP90.00to99.00,SNP,20850,20850,1.0000 99.90,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.00to99.90,SNP,20850,20850,1.0000 100.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.90to100.00,SNP,20850,20850,1.0000

Notice that the 5-sample VQSR tranch file has exactly the same thing throughout all thresholds: 90, 99, 99.90 and 100. and the VQSR modeling plot is also very odd, no plotting at all being seen (the pdf ifle was created but was almost blank in contrast to the normal projection plots I saw in other cases)

However, we did use the old version to call the same 5 samples before, and the trench file looks like below:

Variant quality score tranches file

Version number 4

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,36407,361,2.8657,2.3119,5.0854,TruthSensitivityTranche0.00to90.00,20814,18732,0.9000 99.00,44097,638,2.7655,2.2222,2.2592,TruthSensitivityTranche90.00to99.00,20814,20605,0.9900 99.90,47947,1061,2.7078,1.8750,-7.4143,TruthSensitivityTranche99.00to99.90,20814,20793,0.9990 100.00,50426,2318,2.6645,1.7677,-647.3944,TruthSensitivityTranche99.90to100.00,20814,20814,1.0000

this time, it looks reasonable to me. This is troubling us since for 5 samples, the old version (V1.6-7) seems working fine, whereas the new version (V2.1-13) seems having issue or can not get further filtering by VQSR (90, 99 and 100 got the same result, I did repeat multiple times and got the same results), although for all of the 17 samples, the new version seems fine on VQSR.

So my questions are: 1. is it possible that in some occasion, VQSR can simply not work? 2. Why the old version seems working but not the new version for exactly the same set of 5-sample data?

Thanks a lot for your help!

Mike

Hi,

I'm having a little trouble understanding the relationship between the -ts_filter_level and -tranche settings for VQSR. If I'm not mistaken the defaults are 99 and [100,99.9,99.0,90] respectively. When I run VQSR with these defaults, my tranches are altered because of the 99 ts filter level. I get:

##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=TruthSensitivityTranche99.00to99.90,Description="Truth sensitivity tranche level at VSQ Lod: -0.1838 <= x < 3.1102">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00+,Description="Truth sensitivity tranche level at VQS Lod < -6135.0237">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00,Description="Truth sensitivity tranche level at VSQ Lod: -6135.0237 <= x < -0.1838">

Is it odd that there are two tranches with the same ts values and different VQSLOD values? If I adjust the ts filter level to 90, I get what I originally expected to see:

##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=TruthSensitivityTranche90.00to99.00,Description="Truth sensitivity tranche level at VSQ Lod: 2.5901 <= x < 4.8133">
##FILTER=<ID=TruthSensitivityTranche99.00to99.90,Description="Truth sensitivity tranche level at VSQ Lod: -0.692 <= x < 2.5901">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00+,Description="Truth sensitivity tranche level at VQS Lod < -6.11002079587E7">

Is it just me, or does this seem to be an incompatibility between the defaults values? Which is more important, correct ts filtering or correct tranche intervals? We will at times filter based on these tranches, so I'd like to be setting them correctly. Thanks.

Ben