Tagged with #variantrecalibrator
4 documentation articles | 4 announcements | 68 forum discussions



Created 2013-06-17 22:26:13 | Updated 2014-12-17 17:04:18 | Tags: variantrecalibrator vqsr applyrecalibration
Comments (41)

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

  • TBD

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

  1. Prepare recalibration parameters for SNPs
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  2. Build the SNP recalibration model

  3. Apply the desired level of recalibration to the SNPs in the call set

  4. Prepare recalibration parameters for Indels a. Specify which call sets the program should use as resources to build the recalibration model b. Specify which annotations the program should use to evaluate the likelihood of Indels being real c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches d. Determine additional model parameters

  5. Build the Indel recalibration model

  6. Apply the desired level of recalibration to the Indels in the call set


1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%).

  • Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Estimation of the overall mapping quality of reads supporting a variant call.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.


2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \ 
    -an QD \ 
    -an FS \ 
    -an SOR \ 
    -an MQ \
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff \
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -rscriptFile recalibrate_SNP_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.


3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf 

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.


4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

d. Determine additional model parameters

  • Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.


5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \ 
    -an QD \
    -an DP \ 
    -an FS \ 
    -an SOR \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.


6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -mode INDEL \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -o recalibrated_variants.vcf 

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.


Created 2012-08-02 14:05:29 | Updated 2014-12-17 17:05:58 | Tags: variantrecalibrator bundle vqsr applyrecalibration faq
Comments (25)

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

  • Explanation of resource datasets
  • Important notes about annotations
  • Important notes about exome experiments
  • Argument recommendations for VariantRecalibrator
  • Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).


Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

Resources for SNPs

  • True sites training resource: HapMap

    This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

    This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G
    This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17

  • Known sites resource, not used in training: dbSNP
    This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Resources for Indels

  • Known and true sites training resource: Mills
    This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

Important notes about annotations

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

  • Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

  • You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.

  • The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.


Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.

  • You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.


Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
   -mode SNP \

Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.


Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
 

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
   -mode INDEL \

Created 2012-07-23 23:57:13 | Updated 2012-07-23 23:57:13 | Tags: variantrecalibrator gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at VariantRecalibrator.


Created 2012-07-23 16:49:34 | Updated 2015-06-03 14:42:06 | Tags: variantrecalibrator vqsr applyrecalibration vcf callset variantrecalibration
Comments (35)

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.


Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

  • VariantRecalibrator
    Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

  • ApplyRecalibration
    Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.


Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!


Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (from the bottom, with the highest value of truth sensitivity but usually the lowest values of novel Ti/Tv) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.


Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

  • The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
  • The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
  • The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.


Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.


Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs gatk-3-2
Comments (13)

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.


We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.


Haplotype Caller

  • Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
  • Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
  • The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
  • Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
  • Fixed small bug where indel discovery was inconsistent based on the active-region size.
  • Removed scary warning messages for "VectorPairHMM".
  • Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
  • When we subset PLs because alleles are removed during genotyping we now also subset the AD.
  • Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

  • The -mode argument is now required.
  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

AnalyzeCovariates

  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

  • SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

  • Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
  • Fixed recurring error about missing PLs.
  • No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
  • --includeNonVariantSites should now be working.

Select Variants

  • The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

  • Now throws a user error when it encounters reads with I operators greater than the number of read bases.
  • Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

CalculateGenotypePosteriors

  • Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
  • Now computes genotype posteriors using likelihoods from all members of the trio.
  • Added annotations for calling potential de novo mutations.
  • Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

  • Can now process .list files with -V.
  • Can now handle BCF and Block-Compressed VCF files.

Validate Variants

  • Now works with gVCF files.
  • By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

FastaAlternateReferenceMaker

  • Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

Miscellaneous

  • Refactored maven directories and java packages replacing "sting" with "gatk".
  • Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
  • Added a new read transformer that refactors NDN cigar elements to one N element.
  • Now a Tabix index is created for block-compressed output formats.
  • Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
  • Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
  • We now check that output files are specified in a writeable location.
  • We now allow blank lines in a (non-BAM) list file.
  • Added legibility improvements to the Progress Meter.
  • Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
  • Made IntervalSharder respect the IntervalMergingRule specified on the command line.
  • Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2014-03-17 16:52:43 | Updated 2014-03-19 15:13:51 | Tags: variantrecalibrator haplotypecaller randomlysplitvariants release-notes fastaalternatereferencemaker
Comments (0)

GATK 3.1 was released on March 18, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history


Haplotype Caller

  • Added new capabilities to the Haplotype Caller to use hardware-based optimizations. Can be enabled with --pair_hmm_implementation VECTOR_LOGLESS_CACHING. Please see the 3.1 Version Highlights for more details about expected speed ups and some background on the collaboration that made these possible.
  • Fixed bugs in computing the weights of edges in the assembly graph. This was causing bad genotypes to be output when running the Haplotype Caller over multiple samples simultaneously (as opposed to creating gVCFs in the new recommended pipeline, which was working as expected).

Variant Recalibrator

  • Fixed issue where output could be non-deterministic with very large data sets.

CalculateGenotypePosteriors

  • Fixed several bugs where bad input were causing the tool to crash instead of gracefully exiting with an error message.

Miscellaneous

  • RandomlySplitVariants can now output splits comprised of more than 2 output files.
  • FastaAlternateReferenceMaker can now output heterozygous sites using IUPAC ambiguity encoding.
  • Picard, Tribble, and Variant jars updated to version 1.109.1722.
Comments (2)

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.


Unified Genotyper

  • Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

  • Improved the indexing scheme for gVCF outputs using the reference calculation model.
  • The reference calculation model now works with reduced reads.
  • Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
  • Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

  • Disable tranche plots in INDEL mode.
  • Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

  • Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

  • Added calculation for GC content.
  • Added an option to filter the bases based on their quality scores.

Combine Variants

  • Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

  • Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

Miscellaneous

  • SplitSamFile now produces an index with the BAM.
  • Length metric updates to QualifyMissingIntervals.
  • Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
  • Picard jar updated to version 1.104.1628.
  • Tribble jar updated to version 1.104.1628.
  • Variant jar updated to version 1.104.1628.

Created 2013-08-21 21:15:21 | Updated 2014-02-08 20:09:15 | Tags: indelrealigner unifiedgenotyper variantrecalibrator official haplotypecaller reducereads release-notes
Comments (2)

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history


Reduce Reads

  • Changed the underlying convention of having unstranded reduced reads; instead there are now at least 2 compressed reads at every position, one for each strand (forward and reverse). This allows us to maintain strand information that is useful for downstream filtering.
  • Fixed bug where representative depths were arbitrarily being capped at 127 (instead of the expected 255).
  • Fixed bug where insertions downstream of a variant region weren't triggering a stop to the compression.
  • Fixed bug when using --cancer_mode where alignments were being emitted out of order (and causing the tool to fail).

Unified Genotyper

  • Added --onlyEmitSamples argument that, when provided, instructs that caller to emit only the selected samples into the VCF (even though the calling is performed over all samples present in the provided bam files).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Added a (very) experimental argument (allSitePLs) that will have the caller emit PLs for all sites (including reference sites). Note that this does not give a fully accurate reference model because it models only SNPs. Full a proper handling of the reference model, please use the Haplotype Caller.

Haplotype Caller

  • Added a still somewhat experimental PCR indel error model to the Haplotype Caller. By default this modeling is turned on and is very useful for removing false positive indel calls associated with PCR slippage around short tandem repeats (esp. homopolymers). Users have the option (with the --pcr_indel_model argument) of turning it off or making it even more aggressive (at the expense of losing some true positives too).
  • Added the ability to emit accurate likelihoods for non-variant positions (i.e. what we call a "reference model" that incorporates indels as well as SNP confidences at every position). The output format can be either a record for every position or use the gVCF style recording of blocks. See the --emitRefConfidence argument for more details; note that this replaces the use of "--output_mode EMIT_ALL_SITES" in the HaplotypeCaller.
  • Improvements to the internal likelihoods that are generated by the Haplotype Caller. Specifically, this tool now uses a tri-state correction like the Unified Genotyper, corrects for overlapping read pairs (from the same underlying fragment), and does not run contamination removal (allele-biased downsampling) by default.
  • Several small runtime performance improvements were added (although we are still hard at work on larger improvements that will allow calling to scale to many samples; we're just not there yet).
  • Fixed bug in how adapter clipping was performed (we now clip only after reverting soft-clipped bases).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Improved the "dangling tail" recovery in the assembly algorithm, which allows for higher sensitivity in calling variants at the edges of coverage (e.g. near the ends of targets in an exome).
  • Added the ability to run allele-biased downsampling with different per-sample values like the Unified Genotyper (contributed by Yossi Farjoun).

Variant Annotator

  • Fixed bug where only the last -comp was being annotated at a site.

Indel Realigner

  • Fixed bug that arises because of secondary alignments and that was causing the tool not to update the alignment start of the mate when a read was realigned.

Phase By Transmission

  • Fixed bug where multi-allelic records were being completely dropped by this tool. Now they are emitted unphased.

Variant Recalibrator

  • General improvements to the Gaussian modeling, mostly centered around separating the parameters for the positive and negative training models.
  • The percentBadVariants argument has been replaced with the numBad argument.
  • Added mode to not emit (at all) variant records that are filtered out.
  • This tool now automatically orders the annotation dimensions by their standard deviation instead of the order they were specified on the command-line in order to stabilize the training and have it produce optimal results.
  • Fixed bug where the tool occasionally produced bad log10 values internally.

Miscellaneous

  • General performance improvements to the VCF reading code contributed by Michael McCowan.
  • Error messages are much less verbose and "scary."
  • Added a LibraryReadFilter contributed by Louis Bergelson.
  • Fixed the ReadBackedPileup class to represent mapping qualities as ints, not (signed) bytes.
  • Added the engine-wide ability to do on-the-fly BAM file sample renaming at runtime (see the documentation for the --sample_rename_mapping_file argument for more details).
  • Fixed bug in how the GATK counts filtered reads in the traversal output.
  • Added a new tool called Qualify Intervals.
  • Fixed major bug in the BCF encoding (the previous version was producing problematic files that were failing when trying to be read back into the GATK).
  • Picard/sam/tribble/variant jars updated to version 1.96.1534.

Created 2015-06-23 14:04:01 | Updated | Tags: variantrecalibrator
Comments (7)

When analyzing the whole-exome sequencing data, in the step of VQSR, whether I should use the whole genome vcf file to do that or use the vcf fle containing only exon SNP? Why the two approaches show dramatically different result?


Created 2015-05-07 09:53:30 | Updated | Tags: variantrecalibrator
Comments (2)

If I'm calling variants on an organism (or population) without established common/known variants, is performing variant recalibration redundant? My provisional approach would be to use high-quality calls as a reference variant set, and then recalibrate based upon these, but I'm really not sure if this is correct.

I have read the documentation so apologies if it clearly answers my question already.


Created 2015-04-22 16:10:11 | Updated | Tags: variantrecalibrator vcf-file bovine-nonhuman
Comments (1)

Hi, I have bovine next generation sequence results from illumina miseq platform. In the data I have specific three genes. I need to do variantrecalibration, first what would I need to do this analysis? In the example command line, all the data sets origin is human since I am working with "nonhuman organism" I don't have every file on the list. What information from the file would be essential for the VariantRecalibration step?

Also, I don't have the vcf file but one could be generated from the bovineHD SNP , so the generated file from bovineHD array manifest would be suitable to use for next generation sequence variantrecalibration analysis?

Thank you


Created 2015-04-14 05:23:45 | Updated | Tags: variantrecalibrator indels
Comments (2)

Hi,

I try to run VQSR on my VCF file produced by HaplotypeCaller. It worked pretty well when dealing with SNPs, but errors prompted up when recalibrating INDELs. I am running GATK version3.2. Commend lines are below:

/ifs2/BC_MD/DEV/WorkFlow/Sofware/jre1.7.0_45/bin/java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /ifs2/BC_MD/DEV/WorkFlow/Sofware/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T VariantRecalibrator -R /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/ucsc.hg19.fasta -input ./mem_with_a/HC_raw.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -an QD -an FS -an ReadPosRankSum -an MQRankSum -mode INDEL --maxGaussians 4 -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile ./mem_with_a/HC_indel.VQSR.recal -tranchesFile ./mem_with_a/HC_indel.VQSR.tranches

/ifs2/BC_MD/DEV/WorkFlow/Sofware/jre1.7.0_45/bin/java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /ifs2/BC_MD/DEV/WorkFlow/Sofware/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T ApplyRecalibration -R /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/ucsc.hg19.fasta -input ./mem_with_a/HC_raw.vcf -mode INDEL --ts_filter_level 99.0 -recalFile ./mem_with_a/HC_indel.VQSR.recal -tranchesFile ./mem_with_a/HC_indel.VQSR.tranches -o ./mem_with_a/HC_indel.vcf

Error messages are there:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

Thank you!

Emma


Created 2015-04-10 02:14:20 | Updated | Tags: variantrecalibrator combinevariants combinegvcfs
Comments (3)

Hi all, I have 2 DNAseq batches of samples (more batches are ongoing sequencing, i.e. cohort). I used HaplotypeCaller per sample and got 2 batches of gvcfs, one gvcf per sample. But I am not sure which of the following order is best: (CombineGVCFs per batch) -> GenotypeGVCFs (obtain a single vcf) -> VariantRecalibration, or GenotypeGVCFs per batch -> VariantRecalibration per batch -> CombineVariant (obtain a single vcf). What is the difference between these two orders? and which one is more suitable for my situation? Hoping for any suggestions. Thank you very much ! Best, Hiu


Created 2015-02-23 13:55:20 | Updated | Tags: variantrecalibrator tranches-plot
Comments (4)

Hi Geraldine, I think that this question has been asked before but I cannot find the way to fix the problem. I have just run VariantRecalibrator tool, and I'm getting this (see attached file) tranches plot for SNPs. I have understood correctly the tranches plot of the best practices manual but, I don't know what is going on here. I know that the tranches are sorted by Ti/Tv but for my, this plot makes no sense.

This is my command: java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator -R human_g1k_v37.fasta \ -input all.joinGeno.raw.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile recalibrate_SNP.recal \ -tranchesFile recalibrate_SNP.tranches \ -rscriptFile recalibrate_SNP_plots.R

Any help would be appreciated.


Created 2015-02-18 05:48:33 | Updated 2015-02-18 05:49:23 | Tags: variantrecalibrator annotations variant-recalibration
Comments (8)

Hi, I am working on variant calling for a genotype of rice and I just generated a GVCF file. Here is a snippet of the GVCF file.

Chr1 173923 . C T, 40.74 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:9:73,9,0,73,9,73:0,0,2,1 Chr1 247702 . G A, 16.63 . DP=2;MLEAC=2,0;MLEAF=1.00,0.00;MQ=29.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:48,6,0,48,6,48:0,0,0,0 Chr1 247703 . G . . END=247714 GT:DP:GQ:MIN_DP:PL 0/0:2:5:2:0,6,43 ..... Chr1 248043 . A C, 0 . DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQ=27.07;MQ0=0 GT:AD:DP:GQ:PL:SB 0/0:1,0,0:1:3:0,3,45,3,45,45:0,0,0,0

For the VQSR step, I only have the dbSNP file for rice (there are no hapmap or 1000G files) which has following annotation: CHROM POS ID REF ALT QUAL FILTER INFO Chr1 84 rs349504122 T C . . RSPOS=84;dbSNPBuildID=138;SAO=0;VC=snp;VP=050000000000000000020100 Chr1 2223 rs349642167 C T . . RSPOS=2223;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2228 rs350936653 A C . . RSPOS=2228;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2417 rs347865191 G A . . RSPOS=2417;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2546 rs349956414 A C . . RSPOS=2546;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2588 rs352913853 A C . . RSPOS=2588;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2634 rs348515762 C A . . RSPOS=2634;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100

Now, if I use -an QD for VQSR, it wont work because QD is not present in the input vcf file. So, given the two files above, what annotation can I use for VQSR step? Can I use the VQSR step at all?

And this is the setting in which I am using the dbSNP: -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 Are the settings of known/training/truth correct if dbSNP is the only resource?

Thanks in advance, you have been very helpful NB


Created 2015-02-17 15:58:44 | Updated | Tags: variantrecalibrator gatk
Comments (7)

I'm sorry I keep posting these questions, but this error has me baffled!

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:399) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:143) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183) ... 5 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unable to retrieve result
ERROR ------------------------------------------------------------------------------------------

Here is the input I use to get it:

java -d64 -Xmx8g -jar $GATKDIR/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -input k_family_out.vcf \ -R $REF \ -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 $HAPMAP \ -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 $OMNI \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 $DBSNP \ -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 $MILLS \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum \ -mode BOTH \ -U ALL \ -recalFile k_family.recal \ -tranchesFile k_family.tranches \ -rscriptFile k_family.R \ -nt 6 --TStranche 100.0 --TStranche 99.9 --TStranche 99.5 --TStranche 99.0 \ --TStranche 98.0 --TStranche 97.0 --TStranche 96.0 --TStranche 95.0 --TStranche 94.0 \ --TStranche 93.0 --TStranche 92.0 --TStranche 91.0 --TStranche 90.0

I thought this was similar to a multithreading error, so I reduced my pipeline to execute with only a single node, and the problem persists.

Any help will be appreciated.


Created 2015-02-17 03:22:15 | Updated | Tags: variantrecalibrator variant-recalibration
Comments (6)

Hi, I get the following error with VariantRecalibrator when I run it:

java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Os_nipponbare_v7.0_genome.fasta -input ESSR.vcf -mode SNP -recalFile ERS468243.raw.snps.recal -tranchesFile ERS468243.raw.snps.tranches -rscriptFile ERS468243.recal.plots.R -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 all_sorted.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ

ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.

What am I doing wrong? Am I using the wrong annotation? Thanks in advance for any help. Best, NB


Created 2015-02-09 19:11:28 | Updated | Tags: variantrecalibrator
Comments (1)

I have a quick question re: your GATK Best Practice paper (Curr protoc bioinformatics). Forgive me, since I am quite new to SNP analysis; but I have tried to implement the pipeline using the main Galaxy server. I am more comfortable using Galaxy!

I had no problems until I reached the Variant Recalibrator stage. I can see from the error message that my Hapmap_3.3.hg19.vcf, the Omni one and the 1000g one (all downloaded from the GATK bundle, hg19) do not have the required annotations (DP, FS, QD, MQRankSum and ReadPosRankSum) to fulfill the instructions as per the paper.

I have read all the relevant GATK sections on your website as well, but I am unclear how to use these training sets, since they lack the annotations required. My understanding is that I would need the accompanying BAM files to fully annotate these files? I did attempt to use the Annotator function (without the BAM) on the Hapmap vcf; but I got a message saying that I hadn't enough memory (I have not encounterd such an error with Galaxy before; although I haven't that much Galaxy experience)

Can you suggest where I might find such annotated training sets? (hg19) Or, how to annotate the ones from the bundle?


Created 2015-02-04 05:14:44 | Updated | Tags: variantrecalibrator vqsr vcf gatk
Comments (5)

Hi,

I have generated vcf files using GenotypeGVCFs; each file contains variants corresponding to a different chromosome. I would like to use VQSR to perform the recalibration on all these data combined (for maximum power), but it seems that VQSR only takes a single vcf file, so I would have to combine my vcf files using CombineVariants. Looking at the documentation for CombineVariants, it seems that this tool always produces a union of vcfs. Since each vcf file is chromosome-specific, there are no identical sites across files. My questions are: Is CombineVariants indeed the appropriate tool for me to merge chromosome-specific vcf files, and is there any additional information that I should specify in the command-line when doing this? Do I need to run VariantAnnotator afterwards (I would assume not, since these vcfs were generated using GenotypeGVCFs and the best practices workflow more generally)? I just want to be completely sure that I am proceeding correctly.

Thank you very much in advance, Alva


Created 2015-01-28 23:52:23 | Updated | Tags: variantrecalibrator
Comments (2)

Hi All. For some reason I'm getting this error message when running GATK 3.3 VariantRecalibrator:

Output:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Argument '--resource' requires a value but none was provided
ERROR ------------------------------------------------------------------------------------------

Input: java -d64 -Xmx8g -jar $GATKDIR/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -input b37.vcf \ -R $REF \ -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 $HAPMAP \ -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 $OMNI \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 $DBSNP \ -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 $MILLS \ -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ \ --mode BOTH \ -recalFile b37.recal \ -tranchesFile b37.tranches \ -rscriptFile b37.R \ -nt 6 --TStranche 100.0 --TStranche 99.9 --TStranche 99.5 --TStranche 99.0 \ --TStranche 98.0 --TStranche 97.0 --TStranche 96.0 --TStranche 95.0 --TStranche 94.0 \ --TStranche 93.0 --TStranche 92.0 --TStranche 91.0 --TStranche 90.0

I checked to see if the resource files exists, and they do. I cannot seem to figure out what the problem is.

Any help will be much appreciated


Created 2015-01-27 20:25:42 | Updated | Tags: variantrecalibrator vqsr vcf gatk
Comments (10)

Hi,

I ran VariantRecalibrator and ApplyRecalibration, and everything seems to have worked fine. I just have one question: if there are no reference alleles besides "N" in my recalibrate_SNP.recal and recalibrate_INDEL.recal files, and in the "alt" field simply displays , does that mean that none of my variants were recalibrated? Just wanted to be completely sure. My original file (after running GenotypeGVCFs) has the same number of variants as the recalibrated vcf's.

Thanks, Alva


Created 2015-01-10 08:13:41 | Updated | Tags: unifiedgenotyper variantrecalibrator vqsr haplotypescore annotation
Comments (2)

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired

I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.


Created 2015-01-01 09:08:59 | Updated | Tags: variantrecalibrator
Comments (2)

Hi, I'm running the VariantRecalibrator with input HC gVCF file that was created by GenotypeGVCFs on a list of 66 samples. I'm getting an error message: "MESSAGE: Line 474547: there aren't enough columns for line".

All input vcf lines contain the same #columns. I tried the same command several times and got same error message with different line number. also, the same command that I used with another input file (with less samples, 57), is running fine. Please advise - are there parameters I should change to handle a larger input file?

Thanks, Lily


Created 2014-12-23 01:56:20 | Updated 2014-12-23 01:57:55 | Tags: variantrecalibrator queue
Comments (1)

Hello GATK team, When I use the VariantRecalibrator, I need to set the resource as

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf

When I use the VariantRecalibrator with Queue, I don't know how to set it in the scala script. In the Queue, we receive the parameter as (key value), but how can it be (key:value value)?


Created 2014-11-14 12:54:42 | Updated 2014-11-14 13:10:24 | Tags: variantrecalibrator
Comments (3)

While I run GATK to recalibrate the variants, the error found. I have found some similar posts in the forum, but seems unlucky to find an answer. the log is attached below.


Created 2014-10-27 16:52:19 | Updated | Tags: variantrecalibrator
Comments (5)

Hi, I'm trying to run GATK recalibration pipeline to my data:

java -jar /opt/bin/GenomeAnalysisTK-3.2-0/GenomeAnalysisTK.jar -T VariantRecalibrator -R ../reference/chrall.fa -input output.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ../hapmap_3.3.hg18.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 ../1000G_omni2.5.hg19.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 ../1000G_phase1.indels.hg19.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../dbsnp_138.hg19.vcf \ -an DP \ -an QD \ -an FS \ -an MQRankSum \ -an ReadPosRankSum -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R

But I'm getting the folling error:

**##### ERROR

ERROR MESSAGE: Bad input: Found annotations with zero variance. They must be excluded before proceeding.
ERROR ------------------------------------------------------------------------------------------**

Can anyone tell me what I'm doing wrong? Thanks,

Rômulo


Created 2014-10-08 15:30:10 | Updated | Tags: variantrecalibrator dbsnp variant-recalibration rice
Comments (7)

Hi, I am running GATK on rice libraries and I get the following error with VariantRecalibrator. The command I use is:

java -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R genome.fasta -input chr5.vcf -mode SNP -recalFile chr5.raw.snps.recal -tranchesFile chr5.raw.snps.tranches -rscriptFile chr5.recal.plots.R -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 all_chromosomes.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ

This is how the program terminates:

INFO 21:42:35,099 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 21:42:35,102 TrainingSet - Found dbSNP track: Known = true Training = true Truth = true Prior = Q6.0

INFO 21:43:05,104 ProgressMeter - Chr11:11654383 9.83e+06 30.0 s 3.0 s 87.7% 34.0 s 4.0 s INFO 21:43:07,413 VariantDataManager - QD: mean = 27.02 standard deviation = 7.70 INFO 21:43:07,440 VariantDataManager - MQRankSum: mean = -0.62 standard deviation = 3.35 INFO 21:43:07,479 VariantDataManager - ReadPosRankSum: mean = 0.25 standard deviation = 1.63 INFO 21:43:07,510 VariantDataManager - FS: mean = 2.11 standard deviation = 9.67 INFO 21:43:07,525 VariantDataManager - MQ: mean = 47.09 standard deviation = 11.34 INFO 21:43:07,660 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, MQRankSum, ReadPosRankSum] INFO 21:43:07,680 VariantDataManager - Training with 7845 variants after standard deviation thresholding.

INFO 21:43:09,833 VariantRecalibratorEngine - Convergence after 93 iterations! INFO 21:43:09,872 VariantRecalibratorEngine - Evaluating full set of 272372 variants... INFO 21:43:09,884 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 21:43:10,854 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

I have read similar problems on GATK forums and based on those, it seems to me that the training set of VCFs is too small for my data. Is that so? If so, can you please tell me how can I fix it? This is for rice and I only have 1 set of known VCFs from dbSNP.

Can you also please confirm if my settings for 'known', 'training' and 'truth' are correct.

Thanks a lot in advance.


Created 2014-09-29 11:03:13 | Updated | Tags: variantrecalibrator vqsr
Comments (2)

Hi GATK team,

our lab has a never ending discussion about running VQSR on related samples or having to exclude them. And i guess we need your help to settle this.

We have a multisample call (UG) run on ~1.500 samples, which contains all sorts of unrelated samples, trios and small families. Our statistician tries to convince us to exclude all related samples, because this might skew the VQSR model. The biologists don't follow this argument, but we are unable to convince each other. Do related samples disturb the VQSR?

Even more specific - if we run VQSR on tumor/normal pairs - should we expect surprising behaviour of the model or can we just run the recalibration without worries?

thanks for your help in advance, Oliver


Created 2014-09-19 23:06:10 | Updated | Tags: variantrecalibrator applyrecalibration gatk
Comments (2)

Hi! I am attempting to go through the process of applying a recalibration to an input vcf. I tried following the steps outlines in the tutorial Howto: Recalibrate Variant Quality Scores .

Step 2 runs great, I used all the tranches and resources that the howto mentions. Unfortunately when I attempt to apply the recalibration (step 3) I receive the following error

Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC input @ 1:14907 Q523.77 of type=SNP alleles=[A*, G] attr={AC=1, AF=0.500, AN=2, BaseQRankSum=-1.863, ClippingRankSum=-0.420, DP=86, FS=0.984, MLEAC=1, MLEAF=0.500, MQ=36.93, MQ0=0, MQRankSum=0.338, QD=6.09, ReadPosRankSum=2.064} GT=GT:AD:DP:GQ:PL 0/1:57,29:86:99:552,0,1482

When I check the input file I notice that this is the first variant call in my VCF. I tried this process with other VCF files and each time my ApplyRecalibration gets stuck on the first variant call.

Here is the command I used to generate the recalibration:

 java -Djava.io. -Xmx4g -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \
                -T VariantRecalibrator -R /human_g1k_v37.fasta \
                -L file.intervals \
                -input input_file.vcf \

                -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
                -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
                -resource:1000G,known=false,training=true,truth=false,prior=10.0 Mills_and_1000G_gold_standard.indels.b37.vcf \
                -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \

                -an DP \
                -an QD \
                -an FS \
                -an MQRankSum \
                -an ReadPosRankSum \

                -mode SNP \
                -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
                -recalFile recal_file.recal \
                -tranchesFile tranches_file.tranches \
                -rscriptFile rscipt_file.plots.R\

And here is my command for applying the recalibration:

java -Xmx4g -jar GenomeAnalysisTK.jar
            -T ApplyRecalibration
            -R human_g1k_v37.fasta
            -input input_file.vcf
            -mode SNP
            --ts_filter_level 99.0
            -recalFile  recal_file.recal
            -tranchesFile tranches_files.tranches
            -o recalibrated_output.vcf 

Any suggestions to help me troubleshooot the issue is very welcome. I should also mention that I am using GATK 3.2.2

Thank you in advance!


Created 2014-09-19 11:13:28 | Updated 2014-09-19 11:17:23 | Tags: variantrecalibrator bug
Comments (6)

As adviced by the own GATK program I post this potential bug in the forum:

This is the command I used to call the program from my Perl pipeline:

java -Xmx4g -Djava.io.tmpdir=/tmp -jar /opt/gatk/gatk3.2-3/GenomeAnalysisTK.jar 
-T VariantRecalibrator 
-R /storage/Genomes/GATK/hg19.fasta -input "$path"/"$name".recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 
/storage/Genomes/GATK/Mills_and_1000G_gold_standard.indels.hg19.vcf 
-an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL 
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 
-recalFile "$path"/"$name".recalibrate_INDEL.recal 
-tranchesFile "$path"/"$name".recalibrate_INDEL.tranches 
-rscriptFile "$path"/"$name".recalibrate_INDEL_plots.R 

"INFO 13:04:24,773 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:04:24,776 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 13:04:24,776 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:04:24,777 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:04:24,783 HelpFormatter - Program Args: -T VariantRecalibrator -R /storage/Genomes/GATK/hg19.fasta -input /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /storage/Genomes/GATK/Mills_and_1000G_gold_standard.indels.hg19.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL.recal -tranchesFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL.tranches -rscriptFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL_plots.R INFO 13:04:24,786 HelpFormatter - Executing as jllavin@nextera on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_21-b11. INFO 13:04:24,787 HelpFormatter - Date/Time: 2014/09/19 13:04:24 INFO 13:04:24,787 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:04:24,787 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:04:29,468 GenomeAnalysisEngine - Strictness is SILENT INFO 13:04:30,018 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 13:04:30,244 GenomeAnalysisEngine - Preparing for traversal INFO 13:04:30,271 GenomeAnalysisEngine - Done preparing for traversal INFO 13:04:30,271 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:04:30,272 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:04:30,272 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 13:04:30,279 TrainingSet - Found mills track: Known = true Training = true Truth = true Prior = Q12.0 INFO 13:05:00,823 ProgressMeter - chr5:94965875 1168956.0 30.0 s 26.0 s 31.1% 96.0 s 66.0 s INFO 13:05:24,266 VariantDataManager - DP: mean = 16.62 standard deviation = 10.59 INFO 13:05:24,268 VariantDataManager - FS: mean = 1.14 standard deviation = 2.63 INFO 13:05:24,270 VariantDataManager - MQRankSum: mean = -0.16 standard deviation = 1.11 INFO 13:05:24,273 VariantDataManager - ReadPosRankSum: mean = -0.04 standard deviation = 0.99 INFO 13:05:24,300 VariantDataManager - Annotations are now ordered by their information content: [DP, FS, MQRankSum, ReadPosRankSum] INFO 13:05:24,301 VariantDataManager - Training with 1829 variants after standard deviation thresholding. INFO 13:05:24,305 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 13:05:24,475 VariantRecalibratorEngine - Finished iteration 0. INFO 13:05:24,535 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.20003 INFO 13:05:24,564 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 4.20358 INFO 13:05:24,582 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.00341 INFO 13:05:24,593 VariantRecalibratorEngine - Convergence after 18 iterations! INFO 13:05:24,611 VariantRecalibratorEngine - Evaluating full set of 3202 variants... INFO 13:05:24,612 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 13:05:26,140 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

"

I hope it can be easily fixed since I desperately need my pipeline working for my current analysis
;)


Created 2014-09-12 19:05:19 | Updated 2014-09-12 19:07:12 | Tags: variantrecalibrator error retrieve result
Comments (2)

My command lines are as following:

 java -Xmx8g -jar $CLASSPATH/GenomeAnalysisTK.jar \
 -T VariantRecalibrator \
 -R $GenomeReference \
 -input $InputVCF \
 -nt 6 \
 -resource:mills,known=true,training=true,truth=true,prior=12.0 $resource1 \
 -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $resource2 \
 -an DP -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
 --maxGaussians 8 \
 -mode INDEL \
 -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
 -log $IndelsOutput/Indels.log \
 -recalFile $IndelsOutput/exome.indels.vcf.recal \
 -tranchesFile $IndelsOutput/exome.indels.tranches \
 -rscriptFile $IndelsOutput/exome.indels.recal.plots.R

But I got the following error when running this:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result
    at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
Caused by: java.lang.IllegalArgumentException: No data found.
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138)
    at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226)
    at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183)
    ... 5 more
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):

So I am suspecting this is a bug with GATK 3.2.2 ?


Created 2014-08-20 08:33:12 | Updated | Tags: variantrecalibrator indel-recalibrator
Comments (3)

Hi, I'm using GATK3.2 to VariantRecalibrator INDEL

=========================== here is my command====================================

java -jar /usr/local/bin/GenomeAnalysisTK.jar -R /mnt/md1200/1/public_database/gatk_hg19/ucsc.hg19.fasta -T VariantRecalibrator -input 8_v1.gatk.recal.SNP.vcf -resource:1000G,known=true,training=true,truth=true,prior=12.0 /mnt/md1200/1/public_database/gatk_hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -an MQ -an FS -mode INDEL --maxGaussians 4 -recalFile 8_v1.gatk.SNP.INDEL.recal -tranchesFile 8_v1.gatk.SNP.INDEL.tranches -rscriptFile 8_v1.gatk.SNP.INDEL.plots.R

I run it with an error below

INFO 16:20:29,213 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 16:20:29,214 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 16:20:29,215 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 16:20:29,225 TrainingSet - Found 1000G track: Known = true Training = true Truth = true Prior = Q12.0 INFO 16:20:59,225 ProgressMeter - chr18:33989607 3080007.0 30.0 s 9.0 s 83.4% 35.0 s 5.0 s INFO 16:21:03,085 VariantDataManager - MQ: mean = 38.89 standard deviation = 6.63 INFO 16:21:03,086 VariantDataManager - FS: mean = 0.00 standard deviation = 0.01 INFO 16:21:03,095 VariantDataManager - Annotations are now ordered by their information content: [MQ, FS] INFO 16:21:03,095 VariantDataManager - Training with 123 variants after standard deviation thresholding. WARN 16:21:03,096 VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable. INFO 16:21:03,101 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 16:21:03,272 VariantRecalibratorEngine - Finished iteration 0. INFO 16:21:03,304 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 2.03385 INFO 16:21:03,322 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.01026 INFO 16:21:03,337 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.00960 INFO 16:21:03,350 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.00914 INFO 16:21:03,363 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.00889 INFO 16:21:03,376 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.00887 INFO 16:21:03,389 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.00916 INFO 16:21:03,401 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00991 INFO 16:21:03,407 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.01158 INFO 16:21:03,411 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.01606 INFO 16:21:03,416 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.04015 INFO 16:21:03,420 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.08508 INFO 16:21:03,424 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.05816 INFO 16:21:03,427 VariantRecalibratorEngine - Convergence after 69 iterations! INFO 16:21:03,429 VariantRecalibratorEngine - Evaluating full set of 291 variants... INFO 16:21:03,430 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 16:21:08,746 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.

##### ERROR ------------------------------------------------------------------------------------------

i don't know what's wrong with my data / command. any one meet the same error ? looking for your help . thanks!


Created 2014-07-30 16:12:53 | Updated | Tags: variantrecalibrator indel-call number-of-variants-used-to-train-the-negative-mode
Comments (5)

I've having an issue with VariantRecalibrator which I've faced in the past.

This is the Error which I've seen before: ##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).

And I understand it's because the total number of variants is to low, but I'm unable to see how that could be an issue for my data set.

wc *_genotyped.vcf = 350678 93598833 1851184230 Which contains 200 background individual + 3 target exomes.

And my command: java -jar -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 30 -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf -an MQRankSum -an ReadPosRankSum -an FS -input /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_genotyped.vcf -recalFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_recal -tranchesFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_tranches -rscriptFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_plots.R -mode INDEL

And my verison: The Genome Analysis Toolkit (GATK) vnightly-2014-06-17-gc231c21, Compiled 2014/06/17 00:01:17

So you see my minNumBadVariants is already set to 5000.

Is the error due to the total number of Indels in the VCF file, because otherwise the error of "to few variants" seem wrong.


Created 2014-07-21 13:21:19 | Updated | Tags: variantrecalibrator
Comments (16)

I'm trying to run variant recalibration for SNP as part of the GATK best practice workflow for DNA sequence analysis but am getting the following error presumably because after training there are 0 variants with an LOD score <= -5.0.

The error is:

ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183) ... 5 more

I am using GATK 3.2-2. The full output from this stage is attached.

I also got a similar error with GATK 3.1. I have run it a few times now and although there are very slight variations in the numbers it outputs the result is consistently 0 variants with a LOD <= -5.0.

I have run this stage successfully when I process my reads in two separate chunks but when a combine them all into one processing pipeline they fail at this stage.

Any advice would be welcome on to analyse this and determine what I need to change or where I am going wrong.

Regards,

Ally


Created 2014-06-03 14:57:59 | Updated | Tags: variantrecalibrator training
Comments (13)

Hi,

I've recently updated my GATK pipeline for population-level variant calling of dog WGS data to take advantage of the new GenotypeGVCFs module instead of relying on UnifiedGenotyper or HaplotypeCaller to handle dozens of samples simultaneously. Unfortunately, now when I try to recalibrate my variants using the truth set of ~150K SNPs I was using before, the training eventually converges by excluding all the training variants, leading to a crash. At first I was running the command multithreaded, but this also happens when it's run as a single thread (see stack trace below). Any ideas what's going on?

Thanks, Adam

INFO 16:40:52,233 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:40:52,236 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 16:40:52,236 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:40:52,237 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:40:52,241 HelpFormatter - Program Args: -T VariantRecalibrator -R canFam3.1.fa -input raw.debug.vcf -resource:illumina,known=false,training=true,truth=true,prior=12.0 sites.cf3.1.vcf -an DP -an QD -an MQRankSum -an FS -an ReadPosRankSum -an MQ -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 -recalFile snp.debug.recal -tranchesFile snp.debug.tranches -rscriptFile snp.debug.plots.R INFO 16:40:52,740 HelpFormatter - Executing on Linux 2.6.32-358.14.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 16:40:52,740 HelpFormatter - Date/Time: 2014/06/02 16:40:52 INFO 16:40:52,741 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:40:52,741 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:40:53,837 GenomeAnalysisEngine - Strictness is SILENT INFO 16:40:54,288 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 16:40:55,647 GenomeAnalysisEngine - Preparing for traversal INFO 16:40:55,686 GenomeAnalysisEngine - Done preparing for traversal INFO 16:40:55,704 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 16:40:55,722 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 16:40:56,264 TrainingSet - Found illumina track: Known = false Training = true Truth = true Prior = Q12.0 INFO 16:41:26,324 ProgressMeter - chr1:119694011 1.33e+06 30.0 s 23.0 s 5.0% 10.1 m 9.6 m INFO 16:41:57,528 ProgressMeter - chr10:24041205 1.65e+06 61.0 s 37.0 s 6.1% 16.7 m 15.7 m INFO 16:42:37,192 ProgressMeter - chr12:3177469 3.13e+06 101.0 s 32.0 s 11.2% 15.1 m 13.4 m INFO 16:43:09,795 ProgressMeter - chr14:46556735 5.37e+06 2.2 m 24.0 s 18.6% 12.0 m 9.8 m INFO 16:43:42,333 ProgressMeter - chr17:46123438 7.70e+06 2.8 m 21.0 s 26.2% 10.5 m 7.8 m INFO 16:44:20,667 ProgressMeter - chr2:56532172 1.00e+07 3.4 m 20.0 s 33.9% 10.0 m 6.6 m INFO 16:44:51,269 ProgressMeter - chr22:26635764 1.21e+07 3.9 m 19.0 s 40.7% 9.6 m 5.7 m INFO 16:45:22,151 ProgressMeter - chr25:8747205 1.38e+07 4.4 m 19.0 s 46.6% 9.5 m 5.1 m INFO 16:45:52,154 ProgressMeter - chr27:41179493 1.56e+07 4.9 m 18.0 s 51.7% 9.5 m 4.6 m INFO 16:46:22,532 ProgressMeter - chr3:13905602 1.69e+07 5.4 m 19.0 s 55.9% 9.7 m 4.3 m INFO 16:46:52,534 ProgressMeter - chr3:77994124 1.76e+07 5.9 m 20.0 s 58.6% 10.1 m 4.2 m INFO 16:47:22,536 ProgressMeter - chr30:9518519 1.80e+07 6.4 m 21.0 s 59.6% 10.8 m 4.4 m INFO 16:47:52,538 ProgressMeter - chr30:36929683 1.83e+07 6.9 m 22.0 s 60.7% 11.4 m 4.5 m INFO 16:48:24,345 ProgressMeter - chr31:13298903 1.86e+07 7.5 m 24.0 s 61.4% 12.2 m 4.7 m INFO 16:48:54,346 ProgressMeter - chr31:38887091 1.89e+07 8.0 m 25.0 s 62.4% 12.8 m 4.8 m INFO 16:49:24,348 ProgressMeter - chr32:16826952 1.92e+07 8.5 m 26.0 s 63.2% 13.4 m 4.9 m INFO 16:49:54,350 ProgressMeter - chr33:9665516 1.97e+07 9.0 m 27.0 s 64.5% 13.9 m 4.9 m INFO 16:50:24,352 ProgressMeter - chr34:8616301 2.01e+07 9.5 m 28.0 s 65.8% 14.4 m 4.9 m INFO 16:50:54,354 ProgressMeter - chr34:41058797 2.06e+07 10.0 m 29.0 s 67.1% 14.9 m 4.9 m INFO 16:51:26,984 ProgressMeter - chr35:20891681 2.09e+07 10.5 m 30.0 s 68.0% 15.5 m 4.9 m INFO 16:51:58,253 ProgressMeter - chr38:14260224 2.20e+07 11.0 m 30.0 s 71.4% 15.5 m 4.4 m INFO 16:52:32,120 ProgressMeter - chr4:78859835 2.31e+07 11.6 m 30.0 s 75.1% 15.5 m 3.9 m INFO 16:53:02,719 ProgressMeter - chr6:24630819 2.46e+07 12.1 m 29.0 s 80.2% 15.1 m 3.0 m INFO 16:53:32,858 ProgressMeter - chr7:70954178 2.60e+07 12.6 m 29.0 s 85.3% 14.8 m 2.2 m INFO 16:54:11,822 ProgressMeter - chr9:39193117 2.76e+07 13.3 m 28.0 s 90.4% 14.7 m 84.0 s INFO 16:54:43,302 ProgressMeter - chrUn_JH373262:194062 2.91e+07 13.8 m 28.0 s 97.3% 14.2 m 22.0 s INFO 16:55:17,832 ProgressMeter - chrUn_AAEX03023557:310 2.98e+07 14.4 m 28.0 s 99.2% 14.5 m 6.0 s INFO 16:55:23,341 VariantDataManager - DP: mean = 581.76 standard deviation = 110.45 INFO 16:55:26,420 VariantDataManager - QD: mean = 19.65 standard deviation = 5.17 INFO 16:55:29,554 VariantDataManager - MQRankSum: mean = 0.15 standard deviation = 0.38 INFO 16:55:32,879 VariantDataManager - FS: mean = 3.29 standard deviation = 3.73 INFO 16:55:36,112 VariantDataManager - ReadPosRankSum: mean = 0.25 standard deviation = 0.38 INFO 16:55:39,447 VariantDataManager - MQ: mean = 59.98 standard deviation = 0.43 INFO 16:56:34,985 VariantDataManager - Annotations are now ordered by their information content: [DP, MQ, QD, FS, MQRankSum, ReadPosRankSum] INFO 16:56:35,732 VariantDataManager - Training with 157123 variants after standard deviation thresholding. INFO 16:56:35,740 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 16:57:20,532 VariantRecalibratorEngine - Finished iteration 0. INFO 16:57:32,908 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 1.28438 INFO 16:57:45,208 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.74161 INFO 16:57:57,279 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 7.59772 INFO 16:58:09,478 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.28910 INFO 16:58:22,278 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.05857 INFO 16:58:34,804 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.05605 INFO 16:58:47,116 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.05180 INFO 16:58:59,115 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.04196 INFO 16:59:11,329 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.03361 INFO 16:59:23,682 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.02521 INFO 16:59:38,689 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.02037 INFO 16:59:51,346 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.01612 INFO 17:00:03,973 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.01269 INFO 17:00:16,765 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.00867 INFO 17:00:29,690 VariantRecalibratorEngine - Finished iteration 75. Current change in mixture coefficients = 0.00473 INFO 17:00:42,394 VariantRecalibratorEngine - Finished iteration 80. Current change in mixture coefficients = 0.00248 INFO 17:00:47,807 VariantRecalibratorEngine - Convergence after 82 iterations! INFO 17:00:50,993 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 17:01:12,880 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

Created 2014-05-19 15:14:24 | Updated 2014-05-19 15:15:51 | Tags: variantrecalibrator gatk3
Comments (4)

Hello all,

Quick issue. I'm currently running VariantRecalibrator using the following command.

Program Args: -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 10 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf -an QD -an:HaplotypeScore -an:MQRankSum -an:ReadPosRankSum -an:FS -input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf -recalFile /data2/srynearson/UGP06/fastq/UGP06_snp_recal -tranchesFile /data2/srynearson/UGP06/fastq/UGP06_snp_tranches -rscriptFile /data2/srynearson/UGP06/fastq/UGP06_snp_plots.R -mode SNP

And getting the following error:

ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).

This seems strange given the "-input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf" file above has 100 CEU individuals + one target exome.

I will note that the genotyped.vcf file does only show ~280,000 variants.

I have tried increasing the minNumBadVariants and reducing maxGaussian, but the error either stays the same or changes to "Unable to retrieve result".

Any ideas.

--SR


Created 2014-04-30 17:38:38 | Updated | Tags: variantrecalibrator bug indels no-data-found
Comments (5)

This crops up when running with "-mode INDEL". Not sure why there is no data. (See attached log file with stack trace.)

All input files are non-empty (except the .R file). A similar execution using "-mode SNP" completes with no problems. Since I'm simply looking to get the scripting and flags correct, I've used a public data set. Could it be that I'm unlucky and chose something that has no indels from the reference, which is causing the error? Could there be a more graceful method of termination?


Created 2014-04-15 21:37:57 | Updated | Tags: variantrecalibrator vqsr
Comments (1)

Hi,

Sorry to bother you guys. Just a few quick questions:

1) I'm attempting to download the bundles for VQSR and I noticed that they are for b37 or hg19. If I performed my initial assemblies and later SNP calls with hg38, will this cause an issue? Should I restart the process using either b37 or hg19?

2) I'm still a bit lost on what is considered "too few variants" for VQSR. As VQSR works best when there are thousands of variants - is this recommendation on a per sample basis or for an entire project? I'm presently working with sequences from 80 unique samples for a single gene (~100kbp) and HaplotypeCaller detects on average ~300 raw snps. Would you recommend I hard filter instead in my case?

Thanks,

Dave


Created 2014-04-01 10:45:41 | Updated 2014-04-01 16:35:15 | Tags: variantrecalibrator
Comments (2)

Hi, I am trying to set GATK pipeline with version 3.1. Everything went well so far... Also- SNPs calibration went well. The problem is with applying calibration on indel. Here are the commands I used (the first was finished successfully, but not the second):

java -jar /home/lvzvia/GATK-3.0/GenomeAnalysisTK.jar -T VariantRecalibrator -R /ngs001/db/gatk_resources/bundle/2.5/b37/human_g1k_v37.fasta -input Combine1.vcf --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 /ngs001/db/gatk_resources/bundle/2.5/b37/Mills_and_1000G_gold_standard.indels.b37.vcf -an DP -an FS -an ReadPosRankSum -an MQRankSum -mode INDEL -recalFile raw.INDELs.recal -tranchesFile raw.INDELs.tranches -rscriptFile recal.INDEL.plots.R

java -Xmx3g -jar /home/lvzvia/GATK-3.0/GenomeAnalysisTK.jar -T ApplyRecalibration -R /ngs001/db/gatk_resources/bundle/2.5/b37/human_g1k_v37.fasta -input Combine1.vcf -tranchesFile raw.INDELs.tranches -recalFile raw.INDELs.recal -o recal.INDELs.vcf -ts_filter_level 99.9

The error message:

##### ERROR MESSAGE: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were 
run on the same set of input variants. First seen at: [VC input @ 1:14192 Q86.53 of type=SNP alleles=[G*, A] attr={AC=1, AF=0.019, AN=52, BaseQRankSum=-7.3
10e-01, ClippingRankSum=0.731, DP=401, FS=0.000, InbreedingCoeff=-0.0296, MLEAC=1, MLEAF=0.019, MQ=35.30, MQ0=0, MQRankSum=0.731, QD=17.31, ReadPosRankSum=
-7.310e-01} GT=GT:AD:DP:GQ:PL   0/0:.:2:6:0,6,50        0/0:.:2:6:0,6,71        ./.:.:0 0/0:.:2:6:0,6,72 

The input vcf contained both indel and SNPs and it went well for the SNPs.


Created 2014-03-27 21:34:25 | Updated | Tags: variantrecalibrator error
Comments (2)

I am analyzing ION PGM data

i'm getting 25 variants in one individual with the HaplotypeCaller i try to run the VariantRecalibrator, but i get this error

WARN 18:26:10,578 VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable. INFO 18:26:10,580 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 18:26:10,636 VariantRecalibratorEngine - Finished iteration 0. INFO 18:26:10,639 VariantRecalibratorEngine - Convergence after 3 iterations! INFO 18:26:10,640 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 18:26:11,854 ProgressMeter - chrM:14784 8.87e+06 30.0 s 3.0 s 100.0% 30.0 s 0.0 s INFO 18:26:12,444 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

the command line i use was

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T VariantRecalibrator -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -input 1_raw_variants.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /home/horus/Escritorio/PGM/primirna/references/hapmap_3.3.b37.vcf_nuevo -resource:omni,known=false,training=true,truth=true,prior=12.0 /home/horus/Escritorio/PGM/primirna/references/1000G_omni2.5.b37.vcf_nuevo -resource:1000G,known=false,training=true,truth=false,prior=10.0 /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/horus/Escritorio/PGM/primirna/references/dbsnp_138.hg19.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile 1_recalibrate_SNP.recal -tranchesFile 1_recalibrate_SNP.tranches


Created 2014-03-24 23:49:59 | Updated 2014-03-24 23:52:11 | Tags: variantrecalibrator applyrecalibration variant-recalibration tranches-plot
Comments (20)

Hi, I am planning to filter with ApplyRecal very stringently (I don't mind losing a lot of SNPs to reduce false positives), but I am having trouble viewing the tranches plot produced by VariantRecal to choose my cutoff level. When I run the following command I get the default visualisation of 90, 99, 99.9 & 100 with cumulative TPs & FPs, and everything looks fine:

java -Xmx8g -jar $gatk \ -T VariantRecalibrator \ -R $ref \ -input $vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \ -resource:omni,known=false,training=true,truth=false,prior=12.0 $onek \ -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 $dbsnp \ --minNumBadVariants 1000 \ -an QD \ -an MQRankSum \ -an ReadPosRankSum \ -an MQ \ -an FS \ -an DP \ -mode SNP \ -recalFile ${haplocall_date}_all_sites.snp.recal \ -tranchesFile ${haplocall_date}_all_sites.snp.tranches \ --rscript_file ${haplocall_date}_all_sites.snp.rscript \ 1>snp_recal_all_sites.out 2>snp_recal_all_sites.err &

But when I add in the following flags after '-mode SNP' to change the tranches, the plot displays them out of order (and the Ti/Tv ratio doesn't seem to correlate as I expected):

-tranche 80.0 -tranche 85.0 -tranche 90.0 -tranche 99.0 \

This means the cumulative FP/TP display is no longer appropriate and is making it hard to tell what the actual proportions are at each level. When I specify fewer tranches, it displays them in the same descending order no matter how many and in what order I specify them:

The total number of variants that get through the filter also seems to change depending on what tranches are specified (eg the 85 cutoff has ~50K in the second graph and ~30 in the third graph); I would have thought it would be the same for the same dataset, but I might be misunderstanding something.

1-Why does the number of variants change when the same tranch is called in a different set of tranches? (And the Ti/Tv ratio seems to decrease regardless of whether the tranch is above or below 90)?

2- Why does the default tranch plot show ascending order of the tranches, but when I specify my own cutoffs it never does?

3- Alternatively, is there a way to remove the display of the cumulative TP/FPs to make the plot easier to read?

4- Or perhaps a simpler solution, can I bypass the plot altogether? Does one of the outputs from VariantRecal have the data (or a way of calculating it) about TP and FP predicted variants that is used to produce these plots so I can just compare the numbers?

Thanks very much, and let me know if I need to clarify anything


Created 2014-03-20 21:26:40 | Updated | Tags: variantrecalibrator
Comments (30)

I just updated to the latest nightly and got the same error:

INFO 12:03:16,652 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00258 INFO 12:03:23,474 ProgressMeter - GL000202.1:10465 5.68e+07 32.4 m 34.0 s 98.7% 32.9 m 25.0 s INFO 12:03:32,263 VariantRecalibratorEngine - Convergence after 46 iterations! INFO 12:03:41,008 VariantRecalibratorEngine - Evaluating full set of 4944219 variants... INFO 12:03:41,100 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-03-20-g65934ae):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

Created 2014-03-19 20:13:36 | Updated | Tags: variantrecalibrator error
Comments (3)

Here is my error .. I lauch the same command on 3 svg generated by the same UnifiedGenotyper commen ..

thx!

`java -Djava.io.tmpdir=/scratch/ymb-542-aa/temp -Xmx22g -jar /rap/ymb-542-aa/utils/tools/dna_tools/post_alignment_tools/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -nt 8 -et NO_ET -K /rap/ymb-542-aa/utils/tools/dna_tools/post_alignment_tools/GenomeAnalysisTK-2.8-1/raphael.poujol_umontreal.ca.key -T VariantRecalibrator -R /scratch/ymb-542-aa/pancreas/Genomes/hg19.fa -input /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf -mode SNP -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=10.0 /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf -an QD -an DP -an FS -an MQ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0
-recalFile /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.snps.recal -tranchesFile /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.snps.tranches -rscriptFile /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.snps.plots.RINFO 20:41:18,337 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:41:18,338 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15 INFO 20:41:18,339 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 20:41:18,339 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 20:41:18,342 HelpFormatter - Program Args: -nt 8 -et NO_ET -K /rap/ymb-542-aa/utils/tools/dna_tools/post_alignment_tools/GenomeAnalysisTK-2.8-1/raphael.poujol_umontreal.ca.key -T VariantRecalibrator -R /scratch/ymb-542-aa/pancreas/Genomes/hg19.fa -input /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf -mode SNP -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=10.0 /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf -an QD -an DP -an FS -an MQ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.snps.recal -tranchesFile /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.snps.tranches -rscriptFile /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.snps.plots.R INFO 20:41:18,342 HelpFormatter - Date/Time: 2014/03/18 20:41:18 INFO 20:41:18,342 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:41:18,342 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:41:18,367 ArgumentTypeDescriptor - Dynamically determined type of /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf to be VCF INFO 20:41:18,395 ArgumentTypeDescriptor - Dynamically determined type of /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf to be VCF INFO 20:41:18,400 ArgumentTypeDescriptor - Dynamically determined type of /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf to be VCF INFO 20:41:18,404 ArgumentTypeDescriptor - Dynamically determined type of /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf to be VCF INFO 20:41:19,154 GenomeAnalysisEngine - Strictness is SILENT INFO 20:41:19,258 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 20:41:19,313 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,346 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:19,447 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:19,473 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:19,561 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine INFO 20:41:19,608 GenomeAnalysisEngine - Preparing for traversal INFO 20:41:19,622 GenomeAnalysisEngine - Done preparing for traversal
INFO 20:41:19,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 20:41:19,622 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining WARN 20:41:19,637 Utils - ******************************************************************************** WARN 20:41:19,637 Utils - * WARNING: WARN 20:41:19,637 Utils - * WARN 20:41:19,637 Utils - * Rscript not found in environment path. WARN 20:41:19,637 Utils - * /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.snps.plots.R will be WARN 20:41:19,637 Utils - * generated but PDF plots will not. WARN 20:41:19,637 Utils - ******************************************************************************** INFO 20:41:19,640 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0 INFO 20:41:19,640 TrainingSet - Found omni track: Known = false Training = true Truth = false Prior = Q10.0 INFO 20:41:19,641 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0 INFO 20:41:19,698 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,705 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,713 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:19,722 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,726 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:19,736 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,739 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:19,747 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,751 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:19,760 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,763 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:19,772 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Calling/sub1_focus.haplo.vcf INFO 20:41:19,775 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:19,845 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:19,856 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:19,932 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:19,947 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:21,428 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:21,437 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:21,492 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/hapmap_3.3.hg19.vcf INFO 20:41:21,503 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:21,514 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:21,526 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:21,538 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:21,593 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/1000G_omni2.5.hg19.vcf INFO 20:41:21,605 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:21,667 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:21,859 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/ymb-542-aa/pancreas/Genomes/dbsnp_138.hg19_noXY.vcf INFO 20:41:49,625 ProgressMeter - chr6:33023946 2.48e+07 30.0 s 1.0 s 34.9% 85.0 s 55.0 s INFO 20:42:19,626 ProgressMeter - chr14:38996406 5.30e+07 60.0 s 1.0 s 76.3% 78.0 s 18.0 s INFO 20:42:36,649 VariantDataManager - QD: mean = 16.83 standard deviation = 7.62 INFO 20:42:36,653 VariantDataManager - DP: mean = 19.65 standard deviation = 12.45 INFO 20:42:36,658 VariantDataManager - FS: mean = 0.41 standard deviation = 1.14 INFO 20:42:36,660 VariantDataManager - MQ: mean = 40.98 standard deviation = 2.71 INFO 20:42:36,718 VariantDataManager - Annotations are now ordered by their information content: [MQ, DP, QD, FS] INFO 20:42:36,720 VariantDataManager - Training with 5929 variants after standard deviation thresholding. INFO 20:42:36,729 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 20:42:37,020 VariantRecalibratorEngine - Finished iteration 0. INFO 20:42:37,163 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.32548 INFO 20:42:37,262 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.40180 INFO 20:42:37,360 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.24424 INFO 20:42:37,459 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.30560 INFO 20:42:37,557 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.07436 INFO 20:42:37,655 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.02595 INFO 20:42:37,753 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.01395 INFO 20:42:37,851 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.01163 INFO 20:42:37,948 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.01117 INFO 20:42:38,046 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.02233 INFO 20:42:38,144 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.01832 INFO 20:42:38,242 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.01221 INFO 20:42:38,340 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.01370 INFO 20:42:38,438 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.04717 INFO 20:42:38,536 VariantRecalibratorEngine - Finished iteration 75. Current change in mixture coefficients = 0.01022 INFO 20:42:38,634 VariantRecalibratorEngine - Finished iteration 80. Current change in mixture coefficients = 0.00365 INFO 20:42:38,731 VariantRecalibratorEngine - Finished iteration 85. Current change in mixture coefficients = 0.00875 INFO 20:42:38,829 VariantRecalibratorEngine - Finished iteration 90. Current change in mixture coefficients = 0.00391 INFO 20:42:38,927 VariantRecalibratorEngine - Finished iteration 95. Current change in mixture coefficients = 0.00411 INFO 20:42:39,005 VariantRecalibratorEngine - Convergence after 99 iterations! INFO 20:42:39,037 VariantRecalibratorEngine - Evaluating full set of 12403 variants... INFO 20:42:39,038 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to retrieve result at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) Caused by: java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183) ... 5 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unable to retrieve result
ERROR ------------------------------------------------------------------------------------------

`


Created 2014-03-10 22:03:01 | Updated | Tags: unifiedgenotyper variantrecalibrator applyrecalibration snp indels
Comments (1)

Hi,

I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's.

Below is the error I am getting.

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti

tute.org/discussion/49/using-variant-annotator

The command I am using for this is :

jre1.7.0_40/bin/java -Djava.io.tmpdir=./rb_2905_VCF/tmp -Xmx2g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T VariantRecalibrator -R dbdata/human_g1k_v37.fasta -input ${input_file} --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf - resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -recalFile $destdir/${input_file%recal.snps.vcf}recal.indel.recal -tranchesFile $destdir/${input_file%recal.snps.vcf}recal.indel.tranches -rscriptFile $destdir/${input_file%recal.snps.vcf}recal.indel.plots.R

If I remove the options -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum,then I get this error:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Argument with name '--use_annotation' (-an) is missing.
ERROR ------------------------------------------------------------------------------------------

Created 2014-02-24 16:12:14 | Updated | Tags: variantrecalibrator vqsr applyrecalibration indels
Comments (1)

Hi,

Given that there's no tranche plot generated for indels using VariantRecalibrator, how do we assess which tranche to pick for the next step, ApplyRecalibration? On SNP mode, I'm using tranche plots to evaluate the tradeoff between true and false positive rates at various tranche levels, but that's not possible with indels.

Thanks!

Grace


Created 2014-02-23 15:23:48 | Updated | Tags: variantrecalibrator
Comments (1)

Hi there,

I've a problem running GATK2.8's VariantRecalibrator, I get Code Exception (java.lang.NullPointerException). Below is the error trace:

INFO 17:05:37,528 GenomeAnalysisEngine - Preparing for traversal INFO 17:14:27,706 VariantDataManager - MQRankSum: mean = -0.09 standard deviation = 0.99 INFO 17:14:27,742 VariantDataManager - ReadPosRankSum: mean = 0.18 standard deviation = 0.96 INFO 17:14:27,820 VariantDataManager - Annotations are now ordered by their information content: [QD, ReadPosRankSum, MQRankSum] INFO 17:14:27,823 VariantDataManager - Training with 2739 variants after standard deviation thresholding. INFO 17:14:27,829 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 17:14:28,389 VariantRecalibratorEngine - Finished iteration 0. INFO 17:14:28,534 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.20522 INFO 17:14:28,601 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.46290 INFO 17:14:28,675 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 9.95604 INFO 17:14:28,759 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 9.00456 INFO 17:14:28,861 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.01190 INFO 17:14:28,951 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.00796 INFO 17:14:29,041 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.00546 INFO 17:14:29,131 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00360 INFO 17:14:29,221 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00218 INFO 17:14:29,239 VariantRecalibratorEngine - Convergence after 46 iterations! INFO 17:14:29,276 VariantRecalibratorEngine - Evaluating full set of 13495 variants... INFO 17:14:29,280 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 17:14:32,024 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

And this is the command i used:

java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -Xmx4g -jar ~/genomekey-data/tools/gatk2.jar -T VariantRecalibrator -R ~/genomekey-data/bwa_references/human_g1k_v37.fasta -input ~/simplex/ngs_test/snp_calling/aln.haplotyper.raw.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 ~/genomekey-data/bundle/current/dbsnp_137.b37.vcf -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 ~/genomekey-data/bundle/current/hapmap_3.3.b37.vcf -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 ~/genomekey-data/bundle/current/1000G_omni2.5.b37.vcf -an QD -an MQRankSum -an ReadPosRankSum -recalFile ~/simplex/ngs_test/snp_recal/aln.snp.recal -tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90 -tranchesFile ~/simplex/ngs_test/snp_recal/aln.snp.tranches -rscriptFile ~/simplex/ngs_test/snp_recal/aln.snp.plots.R -mode SNP

Please note that i got the same error running GATK 2.6. Any help or suggestions will be appreciated.

Thanks, Shazly

Comments (4)

I read the documentation for MappingQualityRankSumTest and ReadPosRankSumTest: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_MappingQualityRankSumTest.html http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_ReadPosRankSumTest.html

Both pages read: "The ... rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles."

I have quite a few sites for which MQRankSum and ReadPosRankSum are missing. How does VariantRecalibrator handle this missing information?


Created 2014-02-18 09:57:38 | Updated | Tags: variantrecalibrator
Comments (3)

I use GATK to process a dataset produce by customized target re-sequencing, but I failed in VariantRecalibrator step.

the run log: INFO 17:17:55,949 ArgumentTypeDescriptor - Dynamically determined type of ./hg19.liz.final.regions.bed to be BED INFO 17:17:56,001 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:17:56,001 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15 INFO 17:17:56,001 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:17:56,001 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:17:56,007 HelpFormatter - Program Args: -T VariantRecalibrator -R ./hg19.fa -input ./output.raw.snps.indels.vcf -L./hg19.liz.final.regions.bed -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_138.hg19.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -recalFile output.snp.recal -tranchesFile output.snp.tranches -rscriptFile output.snp.plots.R INFO 17:17:56,007 HelpFormatter - Date/Time: 2014/02/18 17:17:56 INFO 17:17:56,007 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:17:56,008 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:17:56,024 ArgumentTypeDescriptor - Dynamically determined type of ./output.raw.snps.indels.vcf to be VCF INFO 17:17:56,030 ArgumentTypeDescriptor - Dynamically determined type of hapmap_3.3.hg19.vcf to be VCF INFO 17:17:56,033 ArgumentTypeDescriptor - Dynamically determined type of dbsnp_138.hg19.vcf to be VCF INFO 17:17:56,619 GenomeAnalysisEngine - Strictness is SILENT INFO 17:17:56,684 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:17:56,703 RMDTrackBuilder - Loading Tribble index from disk for file ./output.raw.snps.indels.vcf INFO 17:17:56,728 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.hg19.vcf INFO 17:17:56,764 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_138.hg19.vcf INFO 17:17:56,935 IntervalUtils - Processing 801109 bp from intervals INFO 17:17:57,012 GenomeAnalysisEngine - Preparing for traversal INFO 17:17:57,017 GenomeAnalysisEngine - Done preparing for traversal INFO 17:17:57,017 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:17:57,017 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:17:57,024 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0 INFO 17:17:57,024 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q6.0 INFO 17:17:59,530 VariantDataManager - QD: mean = 21.47 standard deviation = 9.62 INFO 17:17:59,530 VariantDataManager - MQRankSum: mean = -0.02 standard deviation = 1.36 INFO 17:17:59,531 VariantDataManager - ReadPosRankSum: mean = 0.50 standard deviation = 1.05 INFO 17:17:59,531 VariantDataManager - FS: mean = 2.59 standard deviation = 7.85 INFO 17:17:59,532 VariantDataManager - MQ: mean = 41.74 standard deviation = 1.10 INFO 17:17:59,538 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, ReadPosRankSum, MQRankSum] INFO 17:17:59,539 VariantDataManager - Training with 402 variants after standard deviation thresholding. WARN 17:17:59,539 VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable. INFO 17:17:59,542 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 17:17:59,658 VariantRecalibratorEngine - Finished iteration 0. INFO 17:17:59,715 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.20178 INFO 17:17:59,726 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.05225 INFO 17:17:59,736 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.05773 INFO 17:17:59,746 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.03079 INFO 17:17:59,755 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.05754 INFO 17:17:59,765 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.09399 INFO 17:17:59,774 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.19735 INFO 17:17:59,784 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00727 INFO 17:17:59,793 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00804 INFO 17:17:59,803 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.00220 INFO 17:17:59,805 VariantRecalibratorEngine - Convergence after 51 iterations! INFO 17:17:59,815 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Is GATK suitable for target re-sequencing data analysis(snp calling && indel calling)?

Could you please help me out? Thanks!

zhoujj


Created 2014-02-12 20:31:19 | Updated | Tags: variantrecalibrator workflow multi-sample
Comments (1)

Hi there,

We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.

We seem to have too few indels to do VQSR on them and have a couple of questions:

  1. Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?

  2. If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?

  3. The other question is whether joint or batch variant calling across several samples would help us in this case?

Thanks in advance!


Created 2014-02-11 18:50:32 | Updated | Tags: variantrecalibrator error
Comments (19)

Hi,

I'm now at the VQSR step of the best practices and to my surprise I got the following error related to java (I think):

Error: Could not find or load main class –jar

Here is my command line:

java –jar GenomeAnalysisTK.jar –T VariantRecalibrator –R ../human_g1k_v37.fasta –input ../raw_variants.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ../hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ../1000G_omni2.5.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../dbsnp_138.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 ../1000G_phase1.snps.high_confidence.b37.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS --maxGaussians 4 –mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 –recalFile raw.SNPs.recal –tranchesFile raw.SNPs.tranches -rscriptFile recal.plots.R

I don't understand what is the problem. Could someone look at it and identify if I made a mistake in my command line?

Thank you for your support


Created 2014-01-31 01:24:53 | Updated | Tags: variantrecalibrator
Comments (10)

Hello GATK team,

Thank you for the excellent tool kit and continuous development. I know this might be easy but I wanted to suppress the legend in the tranches plot of the Variantrecallibration step. I looked up the R code that generated the plot in the GATK directory and couldn't find it. Is there possibly a way to remove that legend or place it on top of the plot instead of in the left bottom corner.

Thank you


Created 2014-01-24 15:20:21 | Updated | Tags: variantrecalibrator plot
Comments (6)

I got this plot after VariantRecalibration for 42 samples in a VCF file. As it can bee seen in the plot there is no "known" variants detected. What is the problem? Which walker do you recommend to solve this issue? thanks


Created 2013-11-14 17:19:47 | Updated | Tags: variantrecalibrator vqsr
Comments (2)

I'm somewhat struggling with the new negative training model in 2.7. Specifically, this paragraph in the FAQ causes me trouble:

Finally, please be advised that while the default recommendation for --numBadVariants is 1000, this value is geared for smaller datasets. This is the number of the worst scoring variants to use when building the model of bad variants. If you have a dataset that's on the large side, you may need to increase this value considerably, especially for SNPs.

And so I keep thinking about how to scale it with my dataset, and I keep wanting to just make it a percentage of the total variants - which is of course the behavior that was removed! In the Version History for 2.7, you say

Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number

Can you comment a little further about how it scales? I'm assuming it's non-linear, and my intuition would be that smaller sets have proportionally more bad variants. Is that what you've seen? Do you have any other observations that could help guide selection of that parameter?


Created 2013-11-13 12:52:32 | Updated | Tags: variantrecalibrator queue
Comments (1)

Hi there, I do apologise in case this question has been answered already but I couldn't find an updated thread on it.

In my previous code I had a conditional based on the number of samples to modify the percentBad parameter in VariantRecalibrator. Now I get an error from my scala code as if the value were not anymore a member of the class.

Checking again in the Best Practices and in the VariantRecalibrator documentation I can't find it anymore. Could you confirm it's not used anymore? or am I doing something wrong?

cheers, Francesco


Created 2013-10-31 00:42:42 | Updated | Tags: variantrecalibrator rscript crash
Comments (3)

I have ran VariantRecalibrator on a smaller VCF file (made with UnifiedGenotyper -L chr1) and it finished with no errors. Then I ran VCF file (made with UnifiedGenotyper without -L parameter) and it crashed but also without any error. The log file of the first successful run looks like this

... INFO 13:26:21,377 VariantRecalibrator - Building FS x DP plot... INFO 13:26:21,379 VariantRecalibratorEngine - Evaluating full set of 18354 variants... INFO 13:26:23,556 VariantRecalibratorEngine - Evaluating full set of 18354 variants... INFO 13:26:25,378 VariantRecalibrator - Building QD x DP plot... INFO 13:26:25,379 VariantRecalibratorEngine - Evaluating full set of 6384 variants... INFO 13:26:26,140 VariantRecalibratorEngine - Evaluating full set of 6384 variants... INFO 13:26:26,832 VariantRecalibrator - Executing: Rscript /cluster11/podlaha/AllelicImbalance/Data/GATK_VCF_Output/13_L_VariantRecalibrator/13_all_snp.plots.R INFO 13:26:28,400 ProgressMeter - chrY:59358202 5.66e+07 14.0 m 14.0 s 98.7% 14.2 m 11.0 s INFO 13:26:58,410 ProgressMeter - chrY:59358202 5.66e+07 14.5 m 15.0 s 98.7% 14.7 m 11.0 s INFO 13:27:28,420 ProgressMeter - chrY:59358202 5.66e+07 15.0 m 15.0 s 98.7% 15.2 m 12.0 s INFO 13:27:58,429 ProgressMeter - chrY:59358202 5.66e+07 15.5 m 16.0 s 98.7% 15.7 m 12.0 s INFO 13:28:28,439 ProgressMeter - chrY:59358202 5.66e+07 16.0 m 16.0 s 98.7% 16.2 m 12.0 s INFO 13:28:58,449 ProgressMeter - chrY:59358202 5.66e+07 16.5 m 17.0 s 98.7% 16.7 m 13.0 s INFO 13:29:28,460 ProgressMeter - chrY:59358202 5.66e+07 17.0 m 18.0 s 98.7% 17.2 m 13.0 s INFO 13:29:58,469 ProgressMeter - chrY:59358202 5.66e+07 17.5 m 18.0 s 98.7% 17.7 m 14.0 s INFO 13:30:28,479 ProgressMeter - chrY:59358202 5.66e+07 18.0 m 19.0 s 98.7% 18.2 m 14.0 s INFO 13:30:58,489 ProgressMeter - chrY:59358202 5.66e+07 18.5 m 19.0 s 98.7% 18.7 m 14.0 s INFO 13:31:28,155 VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/sting/gatk/walkers/variantrecalibration/plot_Tranches.R /cluster11/podlaha/AllelicImbalance/Data/GATK_VCF_Output/13_L_VariantRecalibrator/13_all_snp.tranches 2.15 INFO 13:31:28,499 ProgressMeter - chrY:59358202 5.66e+07 19.0 m 20.0 s 98.7% 19.3 m 15.0 s INFO 13:31:28,847 ProgressMeter - done 5.66e+07 19.0 m 20.0 s 98.7% 19.3 m 15.0 s INFO 13:31:28,847 ProgressMeter - Total runtime 1140.77 secs, 19.01 min, 0.32 hours I ...

The log file of the crashed run ENDS like this ... INFO 19:09:28,987 VariantRecalibrator - Building FS x QD plot... INFO 19:09:28,988 VariantRecalibratorEngine - Evaluating full set of 7300 variants... INFO 19:09:30,191 VariantRecalibratorEngine - Evaluating full set of 7300 variants... INFO 19:09:31,214 VariantRecalibrator - Building FS x DP plot... INFO 19:09:31,217 VariantRecalibratorEngine - Evaluating full set of 21170 variants... INFO 19:09:34,703 VariantRecalibratorEngine - Evaluating full set of 21170 variants... INFO 19:09:37,357 VariantRecalibrator - Building QD x DP plot... INFO 19:09:37,358 VariantRecalibratorEngine - Evaluating full set of 7250 variants... INFO 19:09:38,552 VariantRecalibratorEngine - Evaluating full set of 7250 variants... INFO 19:09:39,550 VariantRecalibrator - Executing: Rscript /cluster11/podlaha/AllelicImbalance/Data/GATK_VCF_Output/13_VariantRecalibrator/13_all_snp.plots.R INFO 19:09:51,101 ProgressMeter - chrY:59358159 5.68e+07 16.5 m 17.0 s 98.7% 16.7 m 13.0 s INFO 19:10:21,111 ProgressMeter - chrY:59358159 5.68e+07 17.0 m 17.0 s 98.7% 17.2 m 13.0 s INFO 19:10:51,119 ProgressMeter - chrY:59358159 5.68e+07 17.5 m 18.0 s 98.7% 17.7 m 14.0 s INFO 19:11:21,129 ProgressMeter - chrY:59358159 5.68e+07 18.0 m 19.0 s 98.7% 18.2 m 14.0 s INFO 19:11:51,139 ProgressMeter - chrY:59358159 5.68e+07 18.5 m 19.0 s 98.7% 18.7 m 14.0 s INFO 19:12:21,148 ProgressMeter - chrY:59358159 5.68e+07 19.0 m 20.0 s 98.7% 19.3 m 15.0 s INFO 19:12:51,157 ProgressMeter - chrY:59358159 5.68e+07 19.5 m 20.0 s 98.7% 19.8 m 15.0 s INFO 19:13:21,167 ProgressMeter - chrY:59358159 5.68e+07 20.0 m 21.0 s 98.7% 20.3 m 16.0 s INFO 19:13:51,176 ProgressMeter - chrY:59358159 5.68e+07 20.5 m 21.0 s 98.7% 20.8 m 16.0 s

I am suspecting that the execution of the tranches Rscript crashed it? Like I said, no error showed up when it crashed. Any suggestions how to make it work?

My command line: INFO 18:53:18,124 HelpFormatter - Program Args: -T VariantRecalibrator -R /cluster8/podlaha/HumanGenome/ucsc.hg19.fasta -mode SNP -input /cluster11/podlaha/AllelicImbalance/Data/GATK_VCF_Output/12_UnifiedGenotyper/12_UG_all.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /cluster11/podlaha/Software/GATK/Resources/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /cluster11/podlaha/Software/GATK/Resources/1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /cluster11/podlaha/Software/GATK/Resources/dbsnp_137.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /cluster11/podlaha/Software/GATK/Resources/1000G_phase1.snps.high_confidence.hg19.vcf -an MQ -an MQ0 -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an DP -an BaseQRankSum -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -numBad 1000 --maxGaussians 4 -recalFile /cluster11/podlaha/AllelicImbalance/Data/GATK_VCF_Output/13_VariantRecalibrator/13_all_snp.recal -tranchesFile /cluster11/podlaha/AllelicImbalance/Data/GATK_VCF_Output/13_VariantRecalibrator/13_all_snp.tranches -rscriptFile /cluster11/podlaha/AllelicImbalance/Data/GATK_VCF_Output/13_VariantRecalibrator/13_all_snp.plots.R

Running GATK 2.7.4. and R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" and java version "1.7.0_40"


Created 2013-10-11 14:17:51 | Updated | Tags: variantrecalibrator haplotypecaller reducereads
Comments (3)

What is the expected effect of using GATK 2.7 HaplotyeCaller+VQSR on a WGS 30x bam or the same bam being processed through ReducedReads beforehand? Does one expect exactly the same variants called on both files or a small difference between them? Do HaplotypeCaller or VQSR treat the input differently if it comes from a full WGS bam or a WGS reduced bam?


Created 2013-10-02 00:56:39 | Updated | Tags: variantrecalibrator haplotypecaller queue mode
Comments (4)

Hi team, I have a basic question, sorry. I am trying to use HaplotypeCaller and VariantRecalibrator with QUEUE but I don't know how add:

VariantRecalibrator.mode = BOTH

HaplotypeCaller.genotyping_mode = DISCOVERY

I get a "not found: value BOTH/DISCOVERY" error.

Thank you in advance,


Created 2013-09-17 09:58:33 | Updated 2013-09-17 10:38:55 | Tags: variantrecalibrator plots
Comments (1)

Hi,

I am running the INDEL model of VariantRecalibrator with GATK 1.6 (last license-compatible version in our hands), and the very last step seems to crash when writing the plot using plot_Tranches.R (see below).

All the previous steps seem to run well on 9G RAM and I presume this plotting step just produces an image to manually inspect and nothing else. Is it possible to switch off the plotting step at all?

Thanks

java -Xmx8g -jar /my/path/GATK/GenomeAnalysisTK.jar -et NO_ET -T VariantRecalibrator -R /my/path/WholeGenomeFASTA/genome.fa -input /my/path/genome.vcf.tmp2 -resource:mills,known=true,training=true,truth=true,prior=12.0 /my/path/Annotation/Mills_Devine_2hit.indels.hg19.vcf -an QD -an FS -an ReadPosRankSum -an DP --mode INDEL -recalFile /my/path/test/file.INDEL.recal -tranchesFile /my/path/test/file.INDEL.tranches

[...] INFO 10:46:37,310 VariantRecalibrator - Writing out recalibration table... INFO 10:46:46,202 VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/sting/gatk/walkers/variantrecalibration/plot_Tranches.R /my/path/test/file.INDEL.tranches 2.15

PS: I tried adding the option "-rscriptFile /dev/null", but it also seems to crash at the same step: INFO 11:25:10,870 VariantRecalibrator - Executing: Rscript /dev/null


Created 2013-09-02 17:08:33 | Updated 2013-09-04 21:09:56 | Tags: variantrecalibrator vcf filter ignorefilter remove
Comments (1)

Hello dear GATK Team,

it seams that the ignoreFilter Argument in VariantRecalibrator does not work. I want to include variants with LowQual filter in the calculation, but cant find the right way to do it. I tried all of these:

-ignoreFilter LowQual
-ignoreFilter [LowQual]
-ignoreFilter "LowQual"
-ignoreFilter "Low Quality"
-ignoreFilter Low Quality
-ignoreFilter [Low Quality]

and also with --ignore_filter instead of -ignoreFilter.

#

Found 2 solutions:

1) Remove the LowQual filter with VariantFiltration and "--invalidatePreviousFilters"

2) "-ignoreFilter LowQual" has to be applied to ApplyRecalibration also...


Created 2013-08-13 04:31:16 | Updated | Tags: variantrecalibrator indels snps mode
Comments (3)

For exome-sequencing, I was wondering how I should decide the "-mode" parameter? should I use "SNP" or "Indel" or both? I want to get both information in the final results, should I use "BOTH" -mode?


Created 2013-08-05 13:55:07 | Updated | Tags: variantrecalibrator
Comments (3)

Hi,

I was running GATK version 2.1-13 and got always the following error when running the variant recalibrator.

java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37.fasta -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -nt 30 --maxGaussians 6 -mode SNP -log ISDB10654.snps.VarRecal.log -recalFile ISDB10654.snps.VarRecal.recal -tranchesFile ISDB10654.snps.VarRecal.tranches -rscriptFile ISDB10654.snps.VarRecal.plot.R -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 dbsnp_135.b37.vcf -input ISDB10654.snps.raw.vcf

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.1-13-g1706365):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
ERROR ------------------------------------------------------------------------------------------

I executed it with the proposed suggestions but is still gave the same error.

Because i am not running the latest version of GATK i first installed and tried the it with the latest version of GATK and now i am getting this error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to retrieve result at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) Caused by: java.lang.IllegalArgumentException: log10p: Values must be non-infinite and non-NAN at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:236) at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:224) at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:249) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.GaussianMixtureModel.evaluateDatumMarginalized(GaussianMixtureModel.java:272) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.GaussianMixtureModel.evaluateDatum(GaussianMixtureModel.java:230) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.evaluateDatum(VariantRecalibratorEngine.java:167) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.evaluateData(VariantRecalibratorEngine.java:100) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:330) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:132) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183) ... 5 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.6-5-gba531bd):
ERROR
ERROR Please check the documentation guide to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unable to retrieve result
ERROR ------------------------------------------------------------------------------------------

I do not really have an idea of what causes this error, do someone have some suggestions to help me?

Thanks,

Robin


Created 2013-07-16 17:56:17 | Updated | Tags: variantrecalibrator indels
Comments (7)

I am trying to recalibrate my VCF files for Indels calling using the below command lines:

java -Xmx2G -jar ../GenomeAnalysisTK.jar -T VariantRecalibrator \ -R ../GATK_ref/hg19.fasta \ -input ./Variants/gcat_set_053_2.raw.snps.indels.vcf \ -nt 4 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 ../GATK_ref/Mills_and_1000G_gold_standard.indels.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../GATK_ref/dbsnp_137.hg19.vcf \ -an DP -an FS -an ReadPosRankSum -an MQRankSum \ --maxGaussians 4 \ -percentBad 0.05 \ -minNumBad 1000 \ -mode INDEL \ -recalFile ./Variants/VQSR/gcat_set_053_2.indels.vcf.recal \ -tranchesFile ./Variants/VQSR/gcat_set_053_2.indels.tranches \ -rscriptFile ./Variants/VQSR/gcat_set_053_2.indels.recal.plots.R > ./Variants/VQSR/IndelRecal2-noAnnot.log

I got this error message, even after taking the recommendation (e.g. maxGaussians 4, --percentBad 0.05). What does this error message mean? my files have too few variants? It's exome-seq.

##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)


Created 2013-07-11 08:11:05 | Updated | Tags: variantrecalibrator
Comments (1)

I have HiSeq exome data, and using GATK v.2.5 While trying to do variant recalibration, I had got an error with default for -percentBad and --maxGaussians. Searching the forum, according to the tip that suggested to loosen those, increasing the first to 0.05 or decreasing the second to 4, the walker worked well. Actually either have been enough for one case of my data. However, for an other data, it finally worked when both were adjusted. Even, another case is being tested with more generous setting. The options that I have controlled are below: -minNumBad , -percentBad , --maxGaussians

What I wonder for options is, 1. Appropriate values could certainly differ sample by sample? ( Sometimes it's natural to try and adjust? ) 2. Are there known values with the most generous level to keep reasonable performance? ( To what extent is it safe to loose the values? )

Any comment much appreciated. Let me know if I missed some information. KIM


Created 2013-06-03 17:08:53 | Updated | Tags: variantrecalibrator vqsr plot gaussian misture model
Comments (8)

HI again!

Could you please help me to generate the first plot in the attached file which refers to VariantRecalibrator?

In other words, is this plot generated at the same time as my_sample.bqrecal.vqsr.R.scripts.pdf? If so, maybe some R library is missing but i can't find anything wrong in the log files (my_sample.bqrecal.vqsr.R.scripts.pdf seems to me fine adn healthy).

Thanks in advance, Rodrigo.


Created 2013-05-03 13:17:13 | Updated | Tags: variantrecalibrator
Comments (2)

Hello!

Thanks for develop this tool set and share with others with good supports, it really contains a lot of wonderful tools.

I just try to understand VQSR more into detail. If I give the resources (dbsnp, hap map and 1kg omni data) recommended by the best practice with default settings (which one is in training, which one is the true set ...), Does the positive set contain all variants which recorded in the resources having train=TRUE, but how does the tool select negative set? Does it order the variants from high to low by the QUAL value, and pick up the 5% from the bottom (if the percentBad = 0.05)? Will there be some overlap between positive set and negative set? And is there any quality filtration on the data, e.g. one date point is more than a standard deviation away from average...

Thanks


Created 2013-04-30 12:45:05 | Updated 2013-04-30 12:45:25 | Tags: variantrecalibrator combinevariants variantannotator annotations
Comments (4)

I have a set of VCFs with identical positions in them:

VCF1: 1 10097 . T . 26 . AN=196;DP=1622;MQ=20.06;MQ0=456 GT:DP

VCF2: 1 10097 . T . 21.34 . AN=198;DP=2338;MQ=19.53;MQ0=633 GT:DP

VCF3: 1 10097 . T . 11.70 . AN=240;DP=3957;MQ=19.74;MQ0=1085 GT:DP

VCF4: 1 10097 . T . 15.56 . AN=134;DP=1348;MQ=18.22;MQ0=442 GT:DP

If I use all of them as input for VariantRecalibrator, which annotations will VariantRecalibrator use? Should I instead merge the VCFs with CombineVariants and run VariantAnnotator, before I run VariantRecalibrator?

I'm not sure if the forum is for asking technical questions only or you are allowed to ask for best practices as well. Feel free to delete my question, if it doesn't belong here. Thank you.


Created 2013-03-25 22:34:33 | Updated | Tags: variantrecalibrator mouse
Comments (9)

Hello, I am just trying VariantRecalibrator on my 4 samples:

java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R gatk.ucsc.mm10.fa -input UnifiedGenotyper.output.snps.raw.vcf -nt 14 -recalFile file_for_ApplyRecalibration.recal -tranchesFile file_for_ApplyRecalibration.tranches -resource:sanger,known=false,training=true,truth=true mgp.v2.snps.annot.reformat.vcf -resource:dbnsp,known=true,training=false,truth=false,prior=6.0 mm10_dbsnp.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mG 4 -percentBad 0.05

which starts running then gives me this error: INFO 08:26:57,741 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NumberFormatException: For input string: "." at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:481) at java.lang.Integer.valueOf(Integer.java:582) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec.decodeInts(AbstractVCFCodec.java:680) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:641) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:92) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:130) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:120) at org.broadinstitute.sting.utils.variantcontext.GenotypesContext.iterator(GenotypesContext.java:461) at org.broadinstitute.sting.utils.variantcontext.VariantContext.getCalledChrCount(VariantContext.java:922) at org.broadinstitute.sting.utils.variantcontext.VariantContext.getCalledChrCount(VariantContext.java:908) at org.broadinstitute.sting.utils.variantcontext.VariantContext.isMonomorphicInSamples(VariantContext.java:937) at org.broadinstitute.sting.utils.variantcontext.VariantContext.isPolymorphicInSamples(VariantContext.java:948) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantDataManager.isValidVariant(VariantDataManager.java:278) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantDataManager.parseTrainingSets(VariantDataManager.java:263) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.map(VariantRecalibrator.java:259) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.map(VariantRecalibrator.java:107) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:243) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:231) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:120) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:67) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:23) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:73) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: "."
ERROR ------------------------------------------------------------------------------------------

I've used all three VCFs in other GATK tools without issues. Any help greatly appreciated!, many thanks, Lavinia.


Created 2013-02-20 03:05:44 | Updated 2013-02-20 03:07:11 | Tags: unifiedgenotyper variantrecalibrator intervals
Comments (1)

Hello, I want to know how important it is to have the -L target-intervals.intervals option in UnifiedGenotyper, and if it is recommended in VariantCallRecalibrator too.

I have ran the Unified Genotyper tool with 1 input file at a time or the 2 files I want to compare at once. My command-lines are the following:

java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1-gatk.vcf  --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals

java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./2-gatk.vcf  --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-2.intervals

java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1vs2-gatk.vcf  --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals -L ./intervals-2.intervals

Thanks, G.


Created 2013-02-19 17:59:36 | Updated 2013-02-19 19:19:52 | Tags: variantrecalibrator bundle
Comments (5)

Hello, I am having a hard time finding the resource vcf files, needed for VariantRecalibration.

hapmap_3.3.b37.sites.vcf
1000G_omni2.5.b37.sites.vcf
dbsnp_135.b37.vcf

I don't seem to find them in the GATK bundle, as suggested:

http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

Any help appreciated Thanks, G.


Created 2013-01-18 08:09:40 | Updated | Tags: unifiedgenotyper variantrecalibrator
Comments (1)

Hi,

I did some tests with a "best practices"-like pipeline to check if results were deterministic and found that they are not. Some posts already mention that UnifiedGenotyper is non-deterministic when using multi-threading as different seeds are used for downsampling. But I think I'm missing something if single-thread UnifiedGenotyper is deterministic, why would it chose exactly the same reads for downsampling? Wouldn't it always be non-deterministic when downsampling reads? Anyway, the difference was only of 31 variants for an exome sample.

About the VariantRecalibrator I guess the filtering is non-deterministic, but I did not found any reference to this in the forum. The difference between runs is greater in this case. After filtering I had 301 variants only non-filtered in the first run; and 1684 variants only non-filtered in the second run; the non-filtered variants in both runs were 11328.

Thanks in advance! Pablo.


Created 2012-12-05 23:15:31 | Updated 2012-12-06 15:56:14 | Tags: variantrecalibrator
Comments (5)

Hi, I'm encountering this error running VariantRecalibrator with data from 3 samples (I'm testing): Maybe is the problem due to small sample size?

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.NullPointerException
        at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantDataManager.selectWorstVariants(VariantDataManager.java:179)
        at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:306)
        at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:107)
        at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
        at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:97)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
        at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
        at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:94)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-16-g9f648cb):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------

Thanks


Created 2012-11-22 00:41:42 | Updated 2012-11-22 01:25:17 | Tags: variantrecalibrator
Comments (3)

experiment: target enrichment and sequencing using Illumina platform

raw VCF file from UnifiedGenotyper -> Variantrecalibrator

I got the following error, any potential explanation? Thanks

##### **ERROR MESSAGE: Bad input: Error during negative model training. Minimum number of variants to use in training is larger than the whole call set. One can attempt to lower the --minNumBadVariants arugment but this is unsafe.**

I tried to lower this number, but different error message came up.

script used:

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator \
-mode BOTH -nt 4 \
-R hg19_all_MT.fasta \
-input two.final.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.fy_left.vcf \
-resourcemni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.site.fy.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=8.0 dbsnp137_sort_fy_left.vcf \
-recalFile two.final.vcf.reca \
-tranchesFile two.final.vcf.tranches \
-rscriptFile two.final.vcf.R \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -tranche 85.0 -tranche 80.0 \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ -an FS -an HRun 

Created 2012-11-09 14:27:10 | Updated 2013-01-07 19:49:58 | Tags: variantrecalibrator filterliftedvariants
Comments (5)

Hello community, I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as the gatk team suggest.

1) I was wondering if you have any suggestion about the parameters of filtration...

I am working with each strain as different species (WGS), so I have good coverage (80X) but only one "Lane" I tried with:

java -Xmx4g -jar GenomeAnalysisTK.jar -R S288c.fasta -T VariantFiltration --variant $1.raw.vcf --filterExpression "QD<2.0 || MQ<45.0 || FS>60 || MQEankSum< -12.5 || ReadPosRankSum<-8.0 " --filterName "hardtovalidate" -o $1.filt.vcf

to remove after the LowQual and hardtovalidate snps, that make sense? thanks for your help!

2) Then after, I would do the VariantRecalibrator, but I will have only one truth set, can I use -mode both, or I should try to obtain a truth data set of indels and do the VQSR for SNP and Indels separately? What do you think?

java -Xmx4g -jar  GenomeAnalysisTK.jar -T VariantRecalibrator  -R ncbi_S288c.fasta -input $1.raw.vcf -recalFile $1.raw.recal -tranchesFile $1.raw.tranches -resource:filtered,known=false,training=true,truth=true,prior=15.0 $1.truth.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP **-mode both** 

Thanks! Luisa


Created 2012-10-27 07:32:18 | Updated 2012-10-29 21:37:57 | Tags: unifiedgenotyper variantrecalibrator haplotypecaller best-practices multi-sample
Comments (1)

Hello,

I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.

The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?

Is there any difference or reason to choose one of the following two workflows over the other?

  1. Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?

  2. Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?

  3. Or some third option I'm missing

Any help is appreciated.

Thanks


Created 2012-10-11 18:12:51 | Updated 2013-01-07 19:13:46 | Tags: variantrecalibrator vqsr tranches
Comments (5)

Hello,

I am running Variant Quality Score Recalibration on indels with the following command.

java -Xmx8g -jar /raid/software/src/GenomeAnalysisTK-1.6-9-g47df7bb/GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R /raid/references-and-indexes/hg19/bwa/hg19_lite.fa \
    -input indel_output_all_chroms_combined.vcf \
    --maxGaussians 4 -std 10.0 -percentBad 0.12 \
    -resource:mills,known=true,training=true,truth=true,prior=12.0  /raid/Merlot/exome_pipeline_v1/ref/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
    -an QD -an FS -an HaplotypeScore -an ReadPosRankSum  \
    --ts_filter_level 95.0 \
     -mode INDEL \
    -recalFile /raid2/projects/STFD/indel_output_7.recal \
    -tranchesFile /raid2/projects/STFD/indel_output_7.tranches \
    -rscriptFile /raid2/projects/STFD/indel_output_7.plots.R

My tranches file reports only false positives for all tranches. When I run VQSR on SNPS, the tranches have many true positives and look similar to other tranch files reported on this site. I am wondering if anyone has similar experiences or suggestions?

Thanks


Created 2012-09-27 09:20:35 | Updated 2012-10-02 16:33:21 | Tags: variantrecalibrator vqsr non-human dbsnp
Comments (9)

Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?

Thanks!