# Tagged with #vqsr 5 documentation articles | 1 announcement | 93 forum discussions

Created 2014-06-05 16:10:25 | Updated 2014-06-05 17:55:23 | Tags: vqsr bqsr phred quality-scores

You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.

### Phred scale in context

In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer.

In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample.

### Phred scale in practice

In today’s sequencing output, by convention, Phred-scaled base quality scores range from 2 to 63. However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A higher score indicates a higher probability that a particular decision is correct, while conversely, a lower score indicates a higher probability that the decision is incorrect.

The Phred quality score (Q) is logarithmically related to the error probability (E).

$$Q = -10 \log E$$

So we can interpret this score as an estimate of error, where the error is e.g. the probability that the base is called incorrectly by the sequencer, but we can also interpret it as an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. Depending on how we decide to express it, we can make the following calculations:

If we want the probability of error (E), we take:

$$E = 10 ^{-\left(\frac{Q}{10}\right)}$$

And conversely, if we want to express this as the estimate of accuracy (A), we simply take

$$\begin{eqnarray} A &=& 1 - E \nonumber \ &=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \ \end{eqnarray}$$

Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.

For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error.

Phred Quality Score Error Accuracy (1 - Error)
10 1/10 = 10% 90%
20 1/100 = 1% 99%
30 1/1000 = 0.1% 99.9%
40 1/10000 = 0.01% 99.99%
50 1/100000 = 0.001% 99.999%
60 1/1000000 = 0.0001% 99.9999%

And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities.

The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically.

Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.

Created 2013-09-11 21:54:51 | Updated 2016-03-16 22:27:40 | Tags: vqsr callset filtering hard-filtering

### The problem:

Our preferred method for filtering variants after the calling step is to use VQSR, a.k.a. recalibration. However, it requires well-curated training/truth resources, which are typically not available for organisms other than humans, and it also requires a large amount of variant sites to operate properly, so it is not suitable for some small-scale experiments such as targeted gene panels or exome studies with fewer than 30 exomes. For the latter, it is sometimes possible to pad your cohort with exomes from another study (especially for humans -- use 1000 Genomes or ExAC!) but again for non-human organisms it is often not possible to do this.

### The solution: hard-filtering

So, if this is your case and you are sure that you cannot use VQSR, then you will need to use the VariantFiltration tool to hard-filter your variants. To do this, you will need to compose filter expressions using JEXL as explained here based on the generic filter recommendations detailed below. There is a tutorial that shows how to achieve this step by step. Be sure to also read the documentation explaining how to understand and improve upon the generic hard filtering recommendations.

### But first, some caveats

Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, what organism it belongs to, etc.

HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.

In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.

In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.

### Filtering recommendations

Here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you. Be sure to read the documentation explaining how to understand and improve upon these recommendations.

Note that these JEXL expressions will tag as filtered any sites where the annotation value matches the expression. So if you use the expression QD < 2.0, any site with a QD lower than 2 will be tagged as failing that filter.

#### For SNPs:

• QD < 2.0
• MQ < 40.0
• FS > 60.0
• SOR > 4.0
• MQRankSum < -12.5
• ReadPosRankSum < -8.0

If your callset was generated with UnifiedGenotyper for legacy reasons, you can add HaplotypeScore > 13.0.

#### For indels:

• QD < 2.0
• ReadPosRankSum < -20.0
• InbreedingCoeff < -0.8
• FS > 200.0
• SOR > 10.0

### And now some more IMPORTANT caveats (don't skip this!)

• The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

• For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.

• The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

### Finally, a note of hope

Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.

HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.

And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?

Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)

We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.

Created 2013-06-17 22:26:13 | Updated 2014-12-17 17:04:18 | Tags: variantrecalibrator vqsr applyrecalibration

#### Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

• TBD

#### Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

#### Steps

1. Prepare recalibration parameters for SNPs
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
d. Determine additional model parameters

2. Build the SNP recalibration model

3. Apply the desired level of recalibration to the SNPs in the call set

4. Prepare recalibration parameters for Indels a. Specify which call sets the program should use as resources to build the recalibration model b. Specify which annotations the program should use to evaluate the likelihood of Indels being real c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches d. Determine additional model parameters

5. Build the Indel recalibration model

6. Apply the desired level of recalibration to the Indels in the call set

### 1. Prepare recalibration parameters for SNPs

#### a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

• True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

• True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

• Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%).

• Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

#### b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Estimation of the overall mapping quality of reads supporting a variant call.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

#### c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

• First tranche threshold 100.0

• Second tranche threshold 99.9

• Third tranche threshold 99.0

• Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

### 2. Build the SNP recalibration model

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R reference.fa \
-input raw_variants.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \
-an DP \
-an QD \
-an FS \
-an SOR \
-an MQ \
-an MQRankSum \
-an InbreedingCoeff \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-rscriptFile recalibrate_SNP_plots.R 

#### Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.

### 3. Apply the desired level of recalibration to the SNPs in the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference.fa \
-input raw_variants.vcf \
-mode SNP \
--ts_filter_level 99.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-o recalibrated_snps_raw_indels.vcf 

#### Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

### 4. Prepare recalibration parameters for Indels

#### a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

• Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

#### b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

#### c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

• First tranche threshold 100.0

• Second tranche threshold 99.9

• Third tranche threshold 99.0

• Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

#### d. Determine additional model parameters

• Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.

### 5. Build the Indel recalibration model

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R reference.fa \
-input recalibrated_snps_raw_indels.vcf \
-resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \
-an QD \
-an DP \
-an FS \
-an SOR \
-an MQRankSum \
-an InbreedingCoeff
-mode INDEL \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
--maxGaussians 4 \
-recalFile recalibrate_INDEL.recal \
-tranchesFile recalibrate_INDEL.tranches \
-rscriptFile recalibrate_INDEL_plots.R 

#### Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.

### 6. Apply the desired level of recalibration to the Indels in the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference.fa \
-input recalibrated_snps_raw_indels.vcf \
-mode INDEL \
--ts_filter_level 99.0 \
-recalFile recalibrate_INDEL.recal \
-tranchesFile recalibrate_INDEL.tranches \
-o recalibrated_variants.vcf 

#### Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

Created 2012-08-02 14:05:29 | Updated 2014-12-17 17:05:58 | Tags: variantrecalibrator bundle vqsr applyrecalibration faq

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

• Explanation of resource datasets
• Important notes about annotations
• Important notes about exome experiments
• Argument recommendations for VariantRecalibrator
• Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

### Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

#### Resources for SNPs

• True sites training resource: HapMap
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

• True sites training resource: Omni
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

• Non-true sites training resource: 1000G
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17

• Known sites resource, not used in training: dbSNP
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

#### Resources for Indels

• Known and true sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

### Important notes about annotations

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

• Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

• You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.

• The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

### Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

• Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.

• You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

### Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

#### Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R path/to/reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-recalFile path/to/output.recal \
-tranchesFile path/to/output.tranches \
-nt 4 \
[SPECIFY TRUTH AND TRAINING SETS] \
[SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
[SPECIFY WHICH CLASS OF VARIATION TO MODEL] \


#### SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
-mode SNP \


Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

#### Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
-resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
-mode INDEL \


Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

### Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

#### Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.


java -Xmx3g -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-tranchesFile path/to/input.tranches \
-recalFile path/to/input.recal \
-o path/to/output.recalibrated.filtered.vcf \
[SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
[SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \


#### SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
-mode SNP \


#### Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
-mode INDEL \


Created 2012-07-23 16:49:34 | Updated 2016-03-16 21:45:56 | Tags: variantrecalibrator vqsr applyrecalibration vcf callset variantrecalibration

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. The first section is a high-level overview aimed at non-specialists. Additional technical details are provided below.

For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article. See the VariantRecalibrator tool doc and the ApplyRecalibration tool doc for a complete description of available command line arguments.

As a complement to this document, we encourage you to watch the workshop videos available in the Presentations section.

## High-level overview

VQSR stands for “variant quality score recalibration”, which is a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score that is supposedly super well calibrated (unlike the variant QUAL score which is a hot mess) called the VQSLOD (for variant quality score log-odds). I know this probably sounds like gibberish, stay with me. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.

The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

The VQSR method, in a nutshell, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”

Let’s do a quick mental visualization exercise (pending an actual figure to illustrate this), in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory within your subset.

How this is achieved is another can of worms. The key point is that we use known, highly validated variant resources (omni, 100 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.

## Technical overview

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

• VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

• ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

### How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

### How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

### Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

### Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (90) which has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is exceedingly specific but less sensitive. Each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.

### Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

• The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
• The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
• The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

### Finally, a couple of Frequently Asked Questions

#### - Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

#### - Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.

Created 2014-12-16 21:48:12 | Updated 2014-12-17 17:06:58 | Tags: vqsr best-practices

The Best Practices recommendations for Variant Quality Score Recalibration have been slightly updated to use the new(ish) StrandOddsRatio (SOR) annotation, which complements FisherStrand (FS) as indicator of strand bias (only available in GATK version 3.3-0 and above).

While we were at it we also reconciled some inconsistencies between the tutorial and the FAQ document. As a reminder, if you ever find differences between parameters given in the VQSR docs, let us know, but FYI that the FAQ is the ultimate source of truth=true. Note also that the command line example given in VariantRecalibrator tool doc tends to be out of date because it can only be updated with the next release (due to a limitation of the tool doc generation system) and, well, we often forget to do it in time -- so it should never be used as a reference for Best Practice parameter values, as indicated in the caveat right underneath it which no one ever reads.

Speaking of caveats, there's no such thing as too much repetition of the fact that whole genomes and exomes have subtle differences that require some tweaks to your command lines. In the case of VQSR, that means dropping Coverage (DP) from your VQSR command lines if you're working with exomes.

Finally, keep in mind that the values we recommend for tranches are really just examples; if there's one setting you should freely experiment with, that's the one. You can specify as many tranche cuts as you want to get really fine resolution.

Created 2016-05-20 09:35:24 | Updated | Tags: variantrecalibrator vqsr resource training

Dear GATK team,

I am working with maize aDNA and would like to find SNPs called in aDNA samples that are at least as good as those in HapMap.

Am right to assume that variant recalibration is the correct tool for the job? My problem is that I can't find details about format of "resources" used for training. (I would like to see your hapmap.vcf but i can't log in to your ftp; it tells me that there can be only 25 users there).

From your documentation (https://www.broadinstitute.org/gatk/guide/article?id=2805 ; https://www.broadinstitute.org/gatk/guide/article?id=1259 and https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php#--resource) I gather that the training set needs to be in vcf file.

Here is my question: Which components are important in resource vcf file; positions? variants? annotations? are SNP calls for each individual important? In other words, is the purpose of training "resource" to 1) find overlapping sites in my vcf and use annotations that are in my vcf or 2) is it the training resource annotations that are being used? If the former is true, would it be possible to use bed files instead? If the latter is true, do I need individual calls? HapMap for my organism has > 1000 individuals and is very heavy; if possible I would like to avoid downloading it all.

Many thanks for your help,

Best wishes, Rafal

Created 2016-04-08 06:46:30 | Updated | Tags: vqsr plots

Hi,

I ran VQSR with 150+ whole genome samples. Attached is one of the Gaussian mixture model plots. I have read the VQSR guide on how to interpret the plots but I am not quite understand. Referring to the plot, green area represents good quality variants while variants fall in red area should be filtered out, the annotations MQ and QD are plotted against each other. For MQ, i can see that good quality variants have MQ around 60. What about QD? What can I do with these plots?

JF

Created 2016-03-28 18:24:17 | Updated | Tags: vqsr tranches

I'm doing a large variant calling project on a cohort of ~10,000 exomes. I've run into an issue with VQSR. Everything appears to be working normally except for my output tranche plot (attached), where I'm seeing no false positives. I know this is too good to be true.

From reading other posts on here, I used dbsnp_138.b37.excluding_sites_after_129.vcf , but this didn't change my plots. Some other details : my exomes were generated with different kits. While running VQSR I have tried using -L with the superset of all capture regions ( probably not the best idea) and the intersection of all capture regions (what I intend to use), but the tranche plots look the same regardless.

Using GATK v3.5. Any suggestions would be greatly appreciated.

Alex

Created 2016-02-12 15:46:30 | Updated 2016-02-12 15:47:31 | Tags: vqsr

Hi, I am trying to run VQSR and an error occurred.

Here are my commands java -Xmx240g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /ref/ucsc.hg19.fasta \ -input input_raw.vcf \ -recalFile snp.recal.vcf \ -tranchesFile snp.tranches \ -rscriptFile recalibrate_SNP_plots.R \ -nt 8 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /ref/hapmap_3.3.hg19.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 /ref/1000G_omni2.5.hg19.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /ref/1000G_phase1.snps.high_confidence.hg19.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /ref/dbsnp_138.hg19.vcf \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \

and error messages.

##### ERROR MESSAGE: Bad input: Values for MQ annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.

However, there MQ are annotated in my input files. Sample size is large enough (around 150 individuals) for VQSR and a combined vcf file of these samples has around 280,000 variants (it's exome data). Genome build version checked with my input data and reference, and GATK version is 3.5

I know similar errors have been reported before, but couldn't find the right solution.

Created 2016-01-29 20:08:17 | Updated | Tags: vqsr vqsr-exome

Hi, I read on the best practices slides that I should not use VQSR if the cohort is small. I only one sample for a single individual. I was wondering how useful it is to perform VQSR on this sample by using the GVCFs from 1000G? Am I losing a lot by not doing VQSR step for a single sample? I have some parameters in mind that I want the final VCF to satisfy: Min Q and Min depth.

Thanks

Created 2016-01-28 15:08:47 | Updated | Tags: indelrealigner variantrecalibrator vqsr haplotypecaller best-practices

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this: 1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero? 2) Is the use of MQCapForLogitJitterTransform recommended? 3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.

Created 2016-01-28 14:42:48 | Updated | Tags: vqsr

## Hi, I have run VQSR on my SNP data and got negative number of novel variants. I attached the tranches file. I am working on whole exome data and here are my commands.

java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R IsaacIndex/genome.fa \ -input data_raw.vcf \ -recalFile data_snp.recal.vcf \ -tranchesFile vqsr_snp.tranches \ -nt 8 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \ -mode SNP \

java -Xmx3g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R IsaacIndex/genome.fa \ -input data_raw.vcf \ -tranchesFile vqsr_snp.tranches \ -recalFile data_snp.recal.vcf \ -o data_snp.vqsr.vcf \ --ts_filter_level 99.5 \ -mode SNP \

java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R IsaacIndex/genome.fa \ -input data_raw.vcf \ -recalFile data_indel.recal.vcf \ -tranchesFile vqsr_indel.tranches \ -nt 8 \ --maxGaussians 4 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \ -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \ -mode INDEL \

java -Xmx3g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R IsaacIndex/genome.fa \ -input data_raw.vcf \ -tranchesFile vqsr_indel.tranches \ -recalFile data_indel.recal.vcf \ -o data_indel.vqsr.vcf \ --ts_filter_level 99.0 \ -mode INDEL \

Created 2016-01-25 00:47:12 | Updated | Tags: vqsr resource variant-recalibration

Hi, I am following through the best practice pipeline and I am confused of which files I should use as the resource to run VQSR.

I have downloaded 4 files and I just want to make sure which file is which database.

1000G_omni2.5.b37.vcf.gz = omni,vcf 1000G_phase1.snps.high_confidence.b37.vcf.gz = 1000G.vcf dbsnp_138.b37.vcf.gz = snp.vcf hapmap_3.3.b37.vcf.gz = hapmap.vcf

I am planning to unzipped them and run VQSR. Did I download the correct files? If not, could you tell me which files I need to use? Thank you.

Shane

Created 2016-01-17 16:39:33 | Updated | Tags: vqsr runtime-error stack-error

Hello everyone!

While trying to build SNP recalibration model using GATK, I have encountered following runtime error, Please help in resolving this issue,

##### ERROR ------------------------------------------------------------------------------------------**

Created 2016-01-17 12:15:32 | Updated 2016-01-17 12:16:21 | Tags: vqsr snps calibration-model

Hello everyone,

Kindly help me in fixing following error while trying to build SNP recalibration model,

$java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Reference/hg19.fa -input raw_variants.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /Reference/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /Reference/1000G_omni2.5.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /Reference/1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /Reference/dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Invalid argument value '-resource:hapmap known=false training=true truth=true prior=15.0' at positi on 6. ##### ERROR Invalid argument value '/Reference/hapmap_3.3.hg19.vcf' at position 7. ##### ERROR Invalid argument value '-resource:omni known=false training=true truth=true prior=12.0' at position 8. ##### ERROR Invalid argument value '/Reference/1000G_omni2.5.hg19.vcf' at position 9. ##### ERROR Invalid argument value '-resource:1000G known=false training=true truth=false prior=10.0' at position 10. ##### ERROR Invalid argument value '/Reference/1000G_phase1.snps.high_confidence.hg19.vcf' at position 11. ##### ERROR Invalid argument value '-resource:dbsnp known=true training=false truth=false prior=2.0' at position 12. ##### ERROR Invalid argument value '/Reference/dbsnp_138.hg19.vcf' at position 13. ##### ERROR ------------------------------------------------------------------------------------------ Thanks and Best Regards, Adnan Created 2016-01-11 15:50:06 | Updated | Tags: vqsr haplotypecaller bug error We are having problems running GATK VQSR on recent VCF files generated using GATK HaplotypeCaller in the N+1 mode. When running VariantRecalibrator we get the following error (with both 3.4 and 3.5), the error occurs. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): … ##### ERROR MESSAGE: The allele with index 5 is not defined in the REF/ALT columns in the record ##### ERROR ------------------------------------------------------------------------------------------ JAVA=~/tools/jre1.8.0_66/bin/java REF=~/refs/bosTau6.fasta GATK=~/tools/GenomeAnalysisTK.jar$JAVA -jar $GATK -T VariantRecalibrator -R$REF -input GATK-HC-3.4-DAMONA_10_Jan_2016.vcf.gz \
-resource:ill770k,known=false,training=true,truth=true,prior=15.0 /home/aeonsim/refs/LIC-seqdHDs.AC1plus.DP10x.vcf.gz \
-resource:VQSRMend,known=false,training=true,truth=false,prior=10.0  /home/projects/bos_taurus/damona/vcfs/recombination/GATK-HC-95.0tr-GQ40-10Dec-2015run.mend.Con.vcf.gz \
-an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 \
-recalFile GATK-HC-DAMONA-Jan-2016.recal -tranchesFile GATK-HC-DAMONA-Jan-2016.tranches \
-rscriptFile GATK-HC-DAMONA-Jan-2016.R

The VCF files were created using GenotypeGVCFs (v3.4-46-gbc02625) on GVCF files created with GATK HaplotypeCaller (N+1 mode, same version) which had been combined using GATK CombineGVCFs to merge ~750 whole bovine genome GVCFs to of 20-90 individuals. All stages occurred to finish successfully with out error until the VQSR stage. All processing of the GATK files was done using GATK tools (v3.4-46) and running the ValidateVariants tool on the files only gives warnings about “Reference allele is too long (116)” nothing about alleles not being defined in the Ref/Alt column.

java -jar GenomeAnalysisTK.jar -T ValidateVariants -R bosTau6.fasta  -V GATK.chr29.vcf.gz --dbsnp BosTau6_
dbSNP140_NCBI.vcf.gz -L chr29
ValidateVariants - Reference allele is too long (116) at position chr29:2513143; skipping that record. Set --referenceWindowStop >= 116
…
Successfully validated the input file.  Checked 531640 records with no failures.

Older versions of GATK worked successfully on part of this dataset last year, but now we've completed the dataset and rerun everything with the latest version of GATK (at the time 3.4-46) using only the GATK tools, GATK VQSR is unable to process the files that were created by it's own HaplotypeCaller (and from the same version), while validation tools supplied with GATK claim there is no problem and programs like bcftools are happily processing the files.

It appears to me that the problem may be around the introduction of the * in the Alt column for VCF4.2 (?) and a failure of VQSR to fully support this?

Created 2015-12-17 19:55:26 | Updated | Tags: vqsr

Hi,

I am trying to build a variant truth set for Macaque based on validated dbsnp entries to train VQSR with. This has required a lot of formatting and I'm wondering if I can get some advice.

I have written a script that writes entries from dbsnp into a VCF file. My file currently looks like this:

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ferguson

When I try to run VQSR using this vcf as a truth set, I get thrown an error. I am using a nightly build of GATK because of a previous issue with htsjdk I ran into in the past, the error looks similar to this past issue (http://gatkforums.broadinstitute.org/discussion/6128/gatk-selectvariants-runtime-error#latest)

Command: : java -Xmx8g -jar ~/GenomeAnalysisTK-nightly-2015-12-17-gec72eac/GenomeAnalysisTK.jar \ -T VariantRecalibrator -R ../../../174\ WGS\:WES/0.174k\ macaque\ genome\ project/ref/mafa5/mafa5.fa \ -input ../../../174\ WGS\:WES/0.174k\ macaque\ genome\ project/01-unfiltered\ variants/16557.all.exons.mafa5.ann.vcf \ -resource:ferguson,known=false,training=true,truth=true,prior=15.0 16797.ferguson.mafa5.mcm.validatedvcf \ -an QD -an MQ -an MQRankSum \ -mode SNP \ -recalFile test.recal \ -tranchesFile test.tranches \ -rscriptFile test.plots.R

Error:

##### ERROR stack trace

java.lang.NullPointerException at htsjdk.variant.vcf.VCFCompoundHeaderLine.(VCFCompoundHeaderLine.java:168) at htsjdk.variant.vcf.VCFInfoHeaderLine.(VCFInfoHeaderLine.java:48) at htsjdk.variant.vcf.AbstractVCFCodec.parseHeaderFromLines(AbstractVCFCodec.java:199) at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:111) at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:88) at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:41) at htsjdk.tribble.index.IndexFactory$FeatureIterator.readHeader(IndexFactory.java:413) at htsjdk.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:401) at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:312) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:441) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:327) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:264) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:153) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1047) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:828) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:286) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

##### ERROR MESSAGE: Code exception (see stack trace for error itself)

Do you know why this is happening? When I run VQSR against another truth set that was called using haplotypecaller and genotypegvcfs I don't run into this problem which is why I believe it has something to do with the format of my dbsnp VCF. I'm hoping to understand how to format VCFs properly to run with VQSR in order to build this set out with more validated entries from dbsnp. Any advice would be greatly appreciated.

Thank you, Trent

Created 2015-12-04 06:45:03 | Updated | Tags: vqsr gatk wgs

Hi. I ran GATK(version GATK3.4-46) using whole genome sequencing data of 60 samples. Using bwa, I aligned my WGS data to human reference genome (GRCh 37) including autosome, X, Y, MT, and GL**. I ran HaplotypeCaller, and then using GenotypeGVCFs, VCF was extracted for chr1 - chr22, chrX, chrY. I wonder what do mean WGS that GATK' Bestpractices recommend. when GATK (VQSR, HC et al) was used, to get the best result**, should WGS be consisted of autosome, X, Y, MT and GL*****, ? Or could it contain only the autosome excluding X ,Y?

Created 2015-12-02 21:32:30 | Updated | Tags: variantrecalibrator vqsr tranches

I ran GATK's cohort genotyping pipeline on 5000 human samples with Illumina WGS ~1.3x data, up through GenotypeGVCFs (and CatVariants to combine chunks) using v3.4-46. Next I ran VariantRecalibrator (initially just chr1) using recommended settings with both v3.4-46 and v3.5. Here is my command for both versions:

java -Xmx40g \ -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R hs37m.fa \ -input gatk.hc.combined.genotyped.chr1-22.vcf.gz \ -recalFile snps.recal \ -tranchesFile snps.tranches \ -rscriptFile recalibrate_SNP_plots.R \ --target_titv 2.15 \ -nt 24 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf.gz \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf.gz \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf.gz \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \ -mode SNP \ -L 1 \ -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 98.5 -tranche 90.0 \ --maxGaussians 6 \ -log VariantRecalibrator.snps.log

The attached tranches plots (snps.tranches.v3.5.pdf) generated w/ v3.5 look strange because:

1) The tranches are out of order on the bar plot (e.g., 99.5 is before 99) 2) The fill coloring doesn't make sense for tranches 99 and 98.5 - there are orange stripes over the blue bar 3) The scatter plot's connecting lines go in both directions

The plots for v3.4-46 look more normal (snps.tranches.v3.4-46.pdf), though I'm still trying to figure out how to get closer to the expected 2.15 Ti/Tv ratio. Oddly, the Ti/Tv ratios differ slightly between v3.4-46 and v3.5 even though the same data and settings were used.

I suspected the behavior w/ v3.5 may be a possible bug in VariantRecalibrator, which is why I'm posting here. Please let me know if you need any more information.

My best,

Chris

Created 2015-12-02 16:54:57 | Updated | Tags: vqsr filtering

I am working with macaque WGS and WES data and I'm trying to implement VQSR effectively. In the NHP genomics community there aren't large databases of validated true variant so we're attempting to create our own sets using high quality amplicon and genotyping data that exists within our own lab. My question is how many variants is VQSR expecting to create an effective gaussian mixture model? Do you have any advice as to how to develop a reasonably sized truth set in the absence of databases like dbsnp, hapmap, ect?

Thank you, Trent

Created 2015-12-01 19:15:09 | Updated | Tags: variantrecalibrator vqsr annotations

Is it possible to run VQSR with custom annotations? I added some additional columns (not produced by GATK) to the INFO field of my VCF file, and included them with the "-an" flag to VariantRecalibrator. The VariantRecalibratorEngine reports convergence, but then I get this:

INFO  12:55:25,499 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
##### ERROR stack trace
java.lang.IllegalArgumentException: No data found.

##### ERROR MESSAGE: No data found.

I used to have error message of "No data found" when performing VQSR with -mode INDEL, and I know it was due to the fact that there were not enough indel variants overlapped with the training data. But this time the mode is SNP. I counted the total for SNPs in my dataset (awk 'length ($4)==1 && length ($5)==1' /path/to/VQSR_0722/All8sample.vcf | wc -l) and there are 184,540 SNPs. I wonder if it is the same cause as that for INDEL VQSR error?

Thank you!

Emma

Created 2015-07-21 15:22:40 | Updated 2015-07-21 15:24:15 | Tags: vqsr vqslod

We are doing WES on hundreds of samples and the average sequencing depth is 60X. I used the sensitivity 99.9% for the PASS filter in VQSR. Attached is the histagram of VQSLOD for around 900,000 SNPs with PASS filter and call rate > 95%. I wonder if you can help me with two questions?

1. is it normal to have half of the SNPs with VQSLOD < 0?
2. why so few SNPs around VQSLOD = 0?

Created 2015-07-08 20:05:08 | Updated | Tags: vqsr vqslod quality-score

I am using GATK 3.4-0 / VQSR to evaluate variants called in ~300 germline genomes of varying coverage (8x-40x). When I do so, many "variants" pass using even a stringent tranche, despite not having a QUAL value or having AC=0 / AF=0. Additionally, some PASS variants have data fields GT:AD:DP, and others have GT:AD:DP:GQ:PL. Why does this occur?

For instance, VQSR Tranche SNP 99.00 to 99.90: -4.4822 <= x < -0.1962

Example: chr21 1419 . T G . PASS AC=0;AF=0.00;AN=524;BaseQRankSum=-1.733e+00;DP=5194;MQ=31.17;MQ0=0;MQRankSum=0.00;NCC=1;ReadPosRankSum=1.73;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VQSLOD=1.67;culprit=DP GT:AD:DP 0/0:2,0:2 0/0:14,0:14 0/0:25,0:25 0/0:1,0:1 0/0:20,0:20 0/0:36,0:36 ./.:8,2 0/0:4,0:4 0/0:4,0:4 0/0:12,0:12 0/0:38,0:38 0/0:3,0:3 0/0:39,0:39 0/0:39,0:39 0/0:14,0:14 0/0:22,0:22 0/0:4,0:4 0/0:22,0:22 0/0:17,0:17 0/0:25,0:25 0/0:32,0:32 0/0:3,0:3 0/0:32,0:32 0/0:30,0:30 0/0:7,0:7 0/0:76,0:76 0/0:28,0:28 0/0:14,0:14 0/0:36,0:36 0/0:24,0:24 0/0:24,0:24 0/0:17,0:17 0/0:7,0:7 0/0:37,0:37 0/0:23,0:23 0/0:23,0:23 0/0:20,0:20 0/0:4,0:4 0/0:4,0:4 0/0:31,0:31 0/0:3,0:3 0/0:24,0:24 0/0:9,0:9 0/0:13,0:13 0/0:5,0:5 0/0:38,0:38 0/0:5,0:5 0/0:23,0:23 0/0:16,0:16 0/0:11,0:11 0/0:32,0:32 0/0:38,0:38 0/0:56,0:56 0/0:28,0:28 0/0:32,0:32 0/0:46,0:46 0/0:18,0:18 0/0:6,0:6 0/0:29,0:29 0/0:7,0:7 0/0:50,0:50 0/0:14,0:14 0/0:33,0:33 0/0:19,0:19 0/0:33,0:33 0/0:28,0:28 0/0:26,0:26 0/0:6,0:6 0/0:19,0:19 0/0:19,0:19 0/0:26,0:26 0/0:23,0:23 0/0:16,0:16 0/0:26,0:26 0/0:22,0:22 0/0:20,0:20 0/0:18,0:18 0/0:18,0:18 0/0:29,0:29 0/0:1,0:1 0/0:34,0:34 0/0:67,0:67 0/0:22,0:22 0/0:16,0:16 0/0:37,0:37 0/0:19,0:19 0/0:5,0:5 0/0:23,0:23 0/0:26,0:26 0/0:24,0:24 0/0:31,0:31 0/0:27,0:27 0/0:23,0:23 0/0:33,0:33 0/0:6,0:6 0/0:42,0:42 0/0:9,0:9 0/0:7,0:7 0/0:34,0:34 0/0:16,0:16 0/0:28,0:28 0/0:5,0:5 0/0:11,0:11 0/0:25,0:25 0/0:39,0:39 0/0:22,1:23 0/0:28,0:28 0/0:15,0:15 0/0:30,0:30 0/0:19,0:19 0/0:5,0:5 0/0:2,0:2 0/0:2,0:2 0/0:8,0:8 0/0:3,0:3 0/0:16,0:16 0/0:31,0:31 0/0:5,0:5 0/0:26,0:26 0/0:20,0:20 0/0:14,0:14 0/0:23,0:23 0/0:49,0:49 0/0:14,0:14 0/0:23,0:23 0/0:48,0:48 0/0:52,1:53 0/0:5,0:5 0/0:6,0:6 0/0:31,0:31 0/0:44,0:44 0/0:7,0:7 0/0:10,0:10 0/0:18,0:18 0/0:14,1:15 0/0:11,0:11 0/0:16,0:16 0/0:17,0:17 0/0:22,0:22 0/0:24,1:25 0/0:15,0:15 0/0:21,0:21 0/0:11,0:11 0/0:12,0:12 0/0:5,1:6 0/0:51,0:51 0/0:40,0:40 0/0:63,0:63 0/0:49,0:49 0/0:73,0:73 0/0:90,0:90 0/0:38,0:38 0/0:56,0:56 0/0:43,0:43 0/0:39,0:39 0/0:45,0:45 0/0:41,0:41 0/0:58,0:58 0/0:22,0:22 0/0:25,0:25 0/0:28,0:28 0/0:30,0:30 0/0:17,0:17 0/0:5,0:5 0/0:2,0:2 0/0:6,0:6 0/0:6,0:6 0/0:16,0:16 0/0:10,0:10 0/0:4,0:4 0/0:11,0:11 0/0:5,0:5 0/0:3,0:3 0/0:2,0:2 0/0:2,0:2 0/0:10,0:10 0/0:22,0:22 0/0:15,0:15 0/0:13,0:13 0/0:22,0:22 0/0:10,0:10 0/0:17,0:17 0/0:15,0:15 0/0:14,0:14 0/0:33,0:33 0/0:14,0:14 0/0:22,0:22 0/0:21,0:21 0/0:22,1:23 0/0:20,0:20 0/0:18,0:18 0/0:14,0:14 0/0:29,0:29 0/0:12,0:12 0/0:21,0:21 0/0:13,0:13 0/0:23,0:23 0/0:21,0:21 0/0:20,0:20 0/0:13,0:13 0/0:9,0:9 0/0:13,0:13 0/0:14,0:14 0/0:23,0:23 0/0:19,0:19 0/0:14,0:14 0/0:19,0:19 0/0:22,0:22 0/0:10,0:10 0/0:15,0:15 0/0:26,0:26 0/0:30,0:30 0/0:16,0:16 0/0:9,0:9 0/0:12,0:12 0/0:12,0:12 0/0:11,0:11 0/0:17,0:17 0/0:10,0:10 0/0:12,0:12 0/0:16,0:16 0/0:9,0:9 0/0:9,0:9 0/0:11,0:11 0/0:14,0:14 0/0:3,0:3 0/0:12,0:12 0/0:5,0:5 0/0:4,0:4 0/0:9,0:9 0/0:4,0:4 0/0:19,0:19 0/0:21,0:21 0/0:6,0:6 0/0:7,0:7 0/0:10,0:10 0/0:13,0:13 0/0:11,0:11 0/0:2,0:2 0/0:9,0:9 0/0:9,0:9 0/0:5,0:5 0/0:6,0:6 0/0:32,0:32 0/0:7,0:7 0/0:12,0:12 0/0:8,0:8 0/0:2,0:2 0/0:19,0:19 0/0:9,0:9 0/0:3,0:3 0/0:8,0:8 0/0:19,0:19 0/0:7,0:7 0/0:9,0:9 0/0:14,0:14 0/0:7,0:7 0/0:8,0:8 0/0:5,0:5 0/0:11,0:11 0/0:14,0:14 0/0:8,1:9 0/0:12,0:12

chr21 1432 . C T 1609.66 PASS AC=5;AF=9.542e-03;AN=524;BaseQRankSum=3.28;DP=6194;FS=0.000;GQ_MEAN=60.76;GQ_STDDEV=44.35;InbreedingCoeff=-0.0114;MLEAC=5;MLEAF=9.542e-03;MQ=39.30;MQ0=0;MQRankSum=0.780;NCC=1;QD=13.64;ReadPosRankSum=-5.450e-01;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;SOR=0.627;VQSLOD=0.834;culprit=FS GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,49 0/0:35,0:35:99:0,102,1530 0/0:39,0:39:93:0,93,1395 0/0:7,0:7:18:0,18,270 0/0:24,0:24:72:0,72,637 0/0:40,0:40:99:0,108,1620 0/0:2,0:2:6:0,6,47 0/0:7,0:7:21:0,21,184 0/0:6,0:6:18:0,18,160 0/0:17,1:18:23:0,23,376 0/0:24,0:24:66:0,66,990 0/0:9,0:9:21:0,21,315 0/0:51,0:51:99:0,120,1800 0/0:39,0:39:99:0,108,1620 0/0:9,0:9:27:0,27,240 0/0:18,0:18:48:0,48,720 0/0:7,0:7:18:0,18,270 0/0:24,0:24:69:0,69,1035 0/0:31,0:31:78:0,78,1170 0/1:18,18:.:99:492,0,409 0/0:30,0:30:84:0,84,1260 0/0:16,0:16:48:0,48,437 0/0:25,0:25:66:0,66,990 0/0:27,0:27:63:0,63,945 0/0:11,0:11:33:0,33,284 0/0:82,0:82:99:0,120,1800 0/0:63,0:63:99:0,120,1800 0/1:2,7:.:19:188,0,19 0/0:45,0:45:99:0,120,1800 0/0:35,0:35:90:0,90,1350 0/0:47,0:47:99:0,120,1800 0/0:16,0:16:48:0,48,451 0/0:19,0:19:45:0,45,675 0/0:47,0:47:99:0,120,1800 0/0:30,0:30:72:0,72,1080 0/0:22,0:22:54:0,54,810 0/0:7,0:7:18:0,18,270 0/0:2,0:2:6:0,6,63 0/0:5,0:5:12:0,12,180 0/0:66,0:66:99:0,120,1800 0/0:6,0:6:15:0,15,225 0/0:28,0:28:75:0,75,1125 0/0:13,1:14:11:0,11,313 0/0:12,0:12:33:0,33,495 0/0:16,0:16:45:0,45,675 0/0:58,0:58:99:0,120,1800 0/0:24,0:24:66:0,66,990 0/0:27,0:27:78:0,78,1170 0/0:25,0:25:66:0,66,990 0/0:25,0:25:75:0,75,660 0/0:41,0:41:99:0,114,1710 0/0:24,0:24:60:0,60,900 0/0:48,0:48:99:0,120,1800 0/0:44,0:44:99:0,120,1800 0/0:30,0:30:81:0,81,1215 0/0:48,0:48:99:0,120,1800 0/0:23,0:23:69:0,69,605 0/0:3,0:3:9:0,9,75 0/0:33,0:33:84:0,84,1260 0/0:8,0:8:21:0,21,315 0/0:75,0:75:99:0,120,1800 0/0:31,0:31:87:0,87,1305 0/0:39,0:39:99:0,111,1665 0/0:15,1:16:20:0,20,356 0/0:61,0:61:99:0,120,1800 0/0:28,0:28:75:0,75,1125 0/0:38,0:38:99:0,102,1530 0/0:11,0:11:30:0,30,450 0/0:15,0:15:45:0,45,449 0/0:24,0:24:72:0,72,633 0/0:29,0:29:84:0,84,1260 0/0:27,0:27:72:0,72,1080 0/0:19,0:19:54:0,54,810 0/0:24,0:24:72:0,72,634 0/0:21,0:21:57:0,57,855 0/0:18,0:18:51:0,51,765 0/0:14,0:14:39:0,39,585 0/0:20,0:20:48:0,48,720 0/0:27,0:27:72:0,72,1080 0/0:2,0:2:3:0,3,45 0/0:32,0:32:81:0,81,1215 0/0:65,0:65:99:0,120,1800 0/0:22,0:22:66:0,66,581 0/0:15,0:15:39:0,39,585 0/0:56,0:56:99:0,120,1800 0/1:13,17:.:99:474,0,273 0/0:8,0:8:21:0,21,315 0/0:23,1:24:60:0,60,580 0/0:29,0:29:75:0,75,1125 0/0:36,0:36:96:0,96,1440 0/0:31,0:31:90:0,90,1350 0/0:29,0:29:78:0,78,1170 0/0:26,0:26:69:0,69,1035 0/0:33,0:33:87:0,87,1305 0/0:13,0:13:30:0,30,450 0/0:64,0:64:99:0,120,1800 0/0:8,0:8:24:0,24,201 0/0:12,0:12:36:0,36,342 0/0:38,0:38:93:0,93,1395 0/0:13,0:13:30:0,30,450 0/0:38,0:38:99:0,105,1575 0/0:6,0:6:15:0,15,225 0/0:6,0:6:18:0,18,141 0/0:26,0:26:75:0,75,1125 0/0:63,0:63:99:0,120,1800 0/0:9,0:9:24:0,24,360 0/0:22,0:22:63:0,63,945 0/0:17,0:17:45:0,45,675 0/0:54,0:54:99:0,120,1800 0/0:18,0:18:45:0,45,675 0/0:9,0:9:24:0,24,360 0/0:2,0:2:6:0,6,63 0/0:5,0:5:12:0,12,180 0/0:11,0:11:30:0,30,450 0/0:4,0:4:9:0,9,135 0/0:11,0:11:30:0,30,450 0/0:33,0:33:87:0,87,1305 0/0:4,0:4:12:0,12,100 0/0:23,0:23:60:0,60,900 0/0:29,0:29:78:0,78,1170 0/0:18,0:18:45:0,45,675 0/1:14,15:.:99:392,0,276 0/0:65,0:65:99:0,120,1800 0/0:31,0:31:78:0,78,1170 0/0:44,0:44:99:0,117,1755 0/0:93,0:93:99:0,120,1800 0/0:75,0:75:99:0,120,1800 0/0:6,0:6:18:0,18,155 0/0:7,0:7:15:0,15,225 0/0:33,0:33:93:0,93,1395 0/0:45,0:45:99:0,120,1800 0/0:17,0:17:45:0,45,675 0/0:14,0:14:42:0,42,377 0/0:28,0:28:78:0,78,1170 0/0:14,0:14:42:0,42,392 0/0:15,0:15:45:0,45,384 0/0:14,0:14:39:0,39,585 0/0:24,0:24:66:0,66,99 0/0:18,0:18:51:0,51,765 0/0:17,0:17:42:0,42,630 0/0:24,0:24:63:0,63,945 0/0:25,0:25:72:0,72,1080 0/0:16,0:16:48:0,48,413 0/0:14,0:14:36:0,36,54 0/0:6,0:6:15:0,15,225 0/0:56,0:56:99:0,120,1800 0/0:47,1:48:99:0,120,1800 0/0:68,0:68:99:0,120,1800 0/0:62,1:63:99:0,120,1800 0/0:64,0:64:99:0,120,1800 0/0:108,1:109:99:0,120,1800 0/0:60,2:62:99:0,120,1800 0/0:76,0:76:99:0,120,1800 0/0:38,0:38:99:0,108,1620 0/0:45,0:45:99:0,120,1800 0/0:39,1:40:87:0,87,1009 0/0:48,0:48:99:0,120,1800 0/0:33,0:33:90:0,90,1350 0/0:29,0:29:75:0,75,1125 0/0:33,0:33:81:0,81,1215 0/0:29,0:29:87:0,87,759 0/0:28,0:28:69:0,69,1035 0/0:22,0:22:60:0,60,900 0/0:7,0:7:21:0,21,183 0/0:6,0:6:15:0,15,225 0/0:7,0:7:15:0,15,225 0/0:15,0:15:36:0,36,540 0/0:9,1:10:4:0,4,195 0/0:7,0:7:21:0,21,175 0/0:11,0:11:21:0,21,315 0/0:4,0:4:9:0,9,135 0/0:9,0:9:27:0,27,239 0/0:5,0:5:12:0,12,180 0/0:7,0:7:21:0,21,197 0/0:7,0:7:18:0,18,270 0/0:13,0:13:30:0,30,450 0/0:18,0:18:51:0,51,765 0/0:18,0:18:51:0,51,765 0/0:17,0:17:51:0,51,442 0/0:21,0:21:54:0,54,810 0/0:19,0:19:51:0,51,765 0/0:22,0:22:57:0,57,855 0/0:16,0:16:45:0,45,675 0/0:22,0:22:66:0,66,594 0/0:30,0:30:75:0,75,1125 0/0:23,0:23:63:0,63,945 0/0:8,0:8:24:0,24,237 0/0:21,0:21:57:0,57,855 0/0:14,0:14:42:0,42,389 0/0:14,0:14:39:0,39,585 0/0:13,0:13:36:0,36,540 0/0:15,0:15:45:0,45,371 0/0:11,0:11:33:0,33,266 0/0:24,0:24:69:0,69,1035 0/0:21,0:21:57:0,57,855 0/0:16,0:16:42:0,42,630 0/0:22,0:22:66:0,66,579 0/0:19,0:19:54:0,54,810 0/0:19,0:19:54:0,54,810 0/0:22,0:22:63:0,63,945 0/0:17,0:17:45:0,45,675 0/0:20,0:20:60:0,60,534 0/0:12,0:12:36:0,36,323 0/0:16,0:16:42:0,42,630 0/0:18,0:18:45:0,45,675 0/0:20,0:20:54:0,54,810 0/0:23,0:23:63:0,63,945 0/0:7,0:7:21:0,21,188 0/0:18,0:18:51:0,51,765 0/0:9,0:9:27:0,27,241 0/0:16,0:16:48:0,48,439 0/0:21,0:21:51:0,51,765 0/0:13,0:13:30:0,30,450 0/0:11,0:11:33:0,33,301 0/0:15,0:15:42:0,42,630 0/0:5,0:5:12:0,12,180 0/0:10,0:10:27:0,27,405 0/0:22,0:22:57:0,57,855 0/0:10,0:10:30:0,30,268 0/0:9,0:9:21:0,21,315 0/0:18,0:18:42:0,42,630 0/0:9,0:9:24:0,24,360 0/0:10,0:10:27:0,27,405 0/0:12,0:12:33:0,33,495 0/0:21,0:21:57:0,57,855 0/0:7,0:7:18:0,18,270 0/0:23,0:23:60:0,60,900 0/0:16,0:16:48:0,48,433 0/0:18,0:18:48:0,48,720 0/0:9,0:9:27:0,27,241 0/0:11,0:11:33:0,33,269 0/0:22,0:22:57:0,57,855 0/0:22,0:22:60:0,60,900 0/1:7,7:.:99:185,0,123 0/0:15,0:15:42:0,42,630 0/0:16,0:16:39:0,39,585 0/0:21,0:21:54:0,54,810 0/0:13,0:13:36:0,36,540 0/0:14,0:14:39:0,39,585 0/0:14,0:14:42:0,42,380 0/0:12,0:12:33:0,33,495 0/0:8,0:8:24:0,24,199 0/0:10,0:10:27:0,27,405 0/0:36,0:36:93:0,93,1395 0/0:8,0:8:24:0,24,170 0/0:10,0:10:30:0,30,248 0/0:13,0:13:39:0,39,305 ./.:0,0:0 0/0:24,0:24:69:0,69,1035 0/0:17,0:17:42:0,42,630 0/0:12,0:12:36:0,36,298 0/0:32,0:32:90:0,90,1350 0/0:22,0:22:66:0,66,535 0/0:9,0:9:27:0,27,216 0/0:17,0:17:42:0,42,630 0/0:17,0:17:39:0,39,585 0/0:11,0:11:30:0,30,450 0/0:18,0:18:54:0,54,417 0/0:11,0:11:27:0,27,405 0/0:17,1:18:40:0,40,404 0/0:22,0:22:63:0,63,945 0/0:6,0:6:18:0,18,162 0/0:14,1:15:34:0,34,351

Created 2015-07-07 07:35:22 | Updated | Tags: vqsr vcf

Hi all!

I've got a questions concerning the VQSR.

The situation is as follows:

• I've got more than 100 Single Sample VCFs
• Unfortunately I wont be able to re-call the VCFs
• Merging the Files into a single Multi-Sample VCF is, in my opinion, a bad idea due to the loss of the information stored in the INFO field
• Creating Multi-Sample VCFs with the help of 1000G would require re-calling or merging, so this also no option.

Therefore, more or less just to see what happens, I specified multiple inputs for the VariantRecalibrator Walker and was able to produce a recal and tranches file. However, its probably still a bad idea to use the recal file for Recalibration since now there are multiple entries for the same variant (this is most likely due to the same variant in multiple single-sample VCFs?)

chr1 871334 . N . . END=871334;POSITIVE_TRAIN_SITE;VQSLOD=1.9214;culprit=MQRankSum chr1 871334 . N . . END=871334;POSITIVE_TRAIN_SITE;VQSLOD=2.0305;culprit=MQ

I guess during the ApplyRecalibration, its not possible to decide which entry for a variant in Single Sample VCF X1 is the correct one. However this would be crucial since the entries show different VQSLOD values.

So in my opinion, its probably not possible to use VQSR in my specific case. However, since I really would like to use it, I thought maybe one of you guys knows a possibility to use it despite all the problems.

Thanks a lot!

Created 2015-06-19 02:29:09 | Updated | Tags: vqsr

Hi,

I would like to apply VQSR on exome dataset. I wonder should "-L" be put in the command line too? And why?

Thank you very much!

Emma

Created 2015-05-26 01:55:14 | Updated | Tags: vqsr vqsr-exome

Hello,

I want to run VQSR for my Exome data. I have finished with data pre-processing and joint genotyping. Now, I want to move to the next step which is VQSR (still in the first step, VariantRecalibrator). I noticed that I don't have the training set for the tools parameter's input. Where can I download the vcf file that I need to run this command? This is what I've tried:

Hapmap (Link) I tried to open HapMap website and I found allocated SNPs download link. Is this the file I need? The file are in XML format and splitted per chromosome and it based in Hg35. Do I need to join these XML and then convert it to VCF using GATK? Will it give some problem with the different HG build (I use HG38)

1000genome (Link) I found the VCF file also per chromosome. I think I just need to join it, don't I? probably you can give some suggestion how to join it properly?

Omni I don't know where I can get this file.

dbSNP I think I already have this file. I have use it during the GATK pre-processing step. It is the same file, right?

Thank you for your help.

Created 2015-05-22 18:56:41 | Updated | Tags: vqsr

Hi,

I encounter an error when I try to run variant recalibration step over 50 WGS samples. This error message comes up somewhere during the program start to run but everytime it stopped at a different chromosomal location. It was really confusing since same command worked fine for me previously.

Hope to get a clue from you.

Following is the error message.

##### ERROR stack trace

java.lang.RuntimeException: java.io.IOException: Transport endpoint is not connected at htsjdk.tribble.readers.LineReaderUtil$2.readLine(LineReaderUtil.java:79) at htsjdk.tribble.readers.LineIteratorImpl.advance(LineIteratorImpl.java:23) at htsjdk.tribble.readers.LineIteratorImpl.advance(LineIteratorImpl.java:10) at htsjdk.samtools.util.AbstractIterator.next(AbstractIterator.java:57) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:79) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:41) at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.readNextRecord(TribbleIndexedFeatureReader.java:449) at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:405) at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:373) at org.broadinstitute.gatk.utils.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:60) at org.broadinstitute.gatk.utils.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:42) at org.broadinstitute.gatk.utils.iterators.PushbackIterator.next(PushbackIterator.java:65) at org.broadinstitute.gatk.utils.iterators.PushbackIterator.element(PushbackIterator.java:51) at org.broadinstitute.gatk.utils.refdata.SeekableRODIterator.next(SeekableRODIterator.java:223) at org.broadinstitute.gatk.utils.refdata.SeekableRODIterator.next(SeekableRODIterator.java:66) at org.broadinstitute.gatk.utils.collections.RODMergingIterator$Element.next(RODMergingIterator.java:72) at org.broadinstitute.gatk.utils.collections.RODMergingIterator.next(RODMergingIterator.java:111) at org.broadinstitute.gatk.utils.collections.RODMergingIterator.allElementsLTE(RODMergingIterator.java:145) at org.broadinstitute.gatk.utils.collections.RODMergingIterator.allElementsLTE(RODMergingIterator.java:129) at org.broadinstitute.gatk.engine.datasources.providers.RodLocusView.getSpanningTracks(RodLocusView.java:140) at org.broadinstitute.gatk.engine.datasources.providers.RodLocusView.next(RodLocusView.java:127) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:172) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:153) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:271) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: Transport endpoint is not connected at java.io.RandomAccessFile.readBytes0(Native Method) at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350) at java.io.RandomAccessFile.read(RandomAccessFile.java:385) at htsjdk.samtools.seekablestream.SeekableFileStream.read(SeekableFileStream.java:80) at htsjdk.tribble.TribbleIndexedFeatureReader$BlockStreamWrapper.read(TribbleIndexedFeatureReader.java:539) at java.io.InputStream.read(InputStream.java:101) at htsjdk.tribble.readers.PositionalBufferedStream.fill(PositionalBufferedStream.java:127) at htsjdk.tribble.readers.PositionalBufferedStream.read(PositionalBufferedStream.java:79) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) at java.io.InputStreamReader.read(InputStreamReader.java:184) at htsjdk.tribble.readers.LongLineBufferedReader.fill(LongLineBufferedReader.java:140) at htsjdk.tribble.readers.LongLineBufferedReader.readLine(LongLineBufferedReader.java:298) at htsjdk.tribble.readers.LongLineBufferedReader.readLine(LongLineBufferedReader.java:354) at htsjdk.tribble.readers.LineReaderUtil$2.readLine(LineReaderUtil.java:77) ... 32 more

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-05-17 19:30:08 | Updated 2015-05-17 20:28:52 | Tags: vqsr runtime-error

I'm currently doing a comparison between 100 greek samples downsampled to 30x and 15x to explore the effects this has on our various tools. I'm currently only evaluating chromosome 6 as I need the initial comparison results soon and something went boom. Curiously enough it only affects the 15x version of the data and not the 30x. I suspect it might be something threading related? I'm going to retry with less and no threads. Confirmed same error with 31 threads, now testing in single threaded mode.

INFO  17:29:10,799 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41
INFO  17:29:10,800 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  17:29:10,800 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:29:10,806 HelpFormatter - Program Args: -T VariantRecalibrator -nt 32 -R /lustre/scratch113/resources/ref/Homo_sapiens/1000Genomes_hs37d5/hs37d5.fa -input greek_bams/15x/15x_annot.vcf.gz --recal_
file greek_bams/15x_vqsr_snp_recal.vcf.gz --tranches_file greek_bams/15x_vqsr_snp_recal.tranches -mode SNP -rscriptFile greek_bams/15x.snp.plot -L 6 -l INFO -resource:hapmap,known=false,training=true,trut
h=true,prior=15.0 /lustre/scratch111/resources/variation/Homo_sapiens/grch37/gatk-bundle/2.5/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /lustre/scratch111/resources/
variation/Homo_sapiens/grch37/gatk-bundle/2.5/1000G_omni2.5.b37.vcf -resource:1000g,known=false,training=true,truth=false,prior=10.0 /lustre/scratch111/resources/variation/Homo_sapiens/grch37/gatk-bundle/
2.5/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /lustre/scratch111/resources/variation/Homo_sapiens/grch37/gatk-bundle/2.8/b37//dbsnp_138.b37.
vcf --target_titv 2.15 -an QD -an MQRankSum -an ReadPosRankSum -an FS -an InbreedingCoeff -an DP -an MQ -an SOR
INFO  17:29:10,811 HelpFormatter - Executing as mercury@hgs4b on Linux 3.8.0-44-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15.
INFO  17:29:10,811 HelpFormatter - Date/Time: 2015/05/17 17:29:10
INFO  17:29:10,812 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:29:10,812 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:29:11,493 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:29:12,058 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:29:13,568 IntervalUtils - Processing 171115067 bp from intervals
WARN  17:29:13,570 IndexDictionaryUtils - Track input doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  17:29:13,620 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 32 processors available on this machine
INFO  17:29:13,758 GenomeAnalysisEngine - Preparing for traversal
INFO  17:29:13,765 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:29:13,766 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:29:13,767 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  17:29:13,768 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  17:29:13,878 TrainingSet - Found hapmap track:    Known = false   Training = true         Truth = true    Prior = Q15.0
INFO  17:29:13,880 TrainingSet - Found omni track:      Known = false   Training = true         Truth = true    Prior = Q12.0
INFO  17:29:13,882 TrainingSet - Found 1000g track:     Known = false   Training = true         Truth = false   Prior = Q10.0
INFO  17:29:13,884 TrainingSet - Found dbsnp track:     Known = true    Training = false        Truth = false   Prior = Q2.0
INFO  17:30:00,834 ProgressMeter -      6:45767321    437766.0    47.0 s     107.0 s       26.7%     2.9 m       2.1 m
INFO  17:30:17,026 VariantDataManager - QD:      mean = 19.92    standard deviation = 5.86
INFO  17:30:17,226 VariantDataManager - MQRankSum:       mean = 0.06     standard deviation = 0.52
INFO  17:30:17,410 VariantDataManager - ReadPosRankSum:          mean = 0.25     standard deviation = 0.52
INFO  17:30:17,601 VariantDataManager - FS:      mean = 2.65     standard deviation = 3.93
INFO  17:30:17,790 VariantDataManager - InbreedingCoeff:         mean = -0.00    standard deviation = 0.19
INFO  17:30:17,985 VariantDataManager - DP:      mean = 1432.94  standard deviation = 185.41
INFO  17:30:18,179 VariantDataManager - MQ:      mean = 59.94    standard deviation = 0.72
INFO  17:30:18,374 VariantDataManager - SOR:     mean = 0.78     standard deviation = 0.40
INFO  17:30:19,569 VariantDataManager - Annotations are now ordered by their information content: [DP, MQ, QD, FS, ReadPosRankSum, MQRankSum, SOR, InbreedingCoeff]
INFO  17:30:19,642 VariantDataManager - Training with 611167 variants after standard deviation thresholding.
INFO  17:30:19,648 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO  17:30:30,839 ProgressMeter -     6:171052865   4569513.0    77.0 s      16.0 s      100.0%    77.0 s       0.0 s
INFO  17:31:00,843 ProgressMeter -     6:171052865   4569513.0   107.0 s      23.0 s      100.0%   107.0 s       0.0 s
INFO  17:31:30,847 ProgressMeter -     6:171052865   4569513.0     2.3 m      29.0 s      100.0%     2.3 m       0.0 s
INFO  17:31:35,265 VariantRecalibratorEngine - Finished iteration 0.
INFO  17:32:00,850 ProgressMeter -     6:171052865   4569513.0     2.8 m      36.0 s      100.0%     2.8 m       0.0 s
INFO  17:32:25,369 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 1.82124

...

INFO  17:45:45,833 VariantRecalibratorEngine - Finished iteration 95.   Current change in mixture coefficients = 0.00236
INFO  17:46:00,990 ProgressMeter -     6:171052865   4569513.0    16.8 m       3.7 m      100.0%    16.8 m       0.0 s
INFO  17:46:12,074 VariantRecalibratorEngine - Convergence after 98 iterations!
INFO  17:46:17,393 VariantRecalibratorEngine - Evaluating full set of 985716 variants...
INFO  17:46:17,455 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
INFO  17:46:27,147 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result
Caused by: java.lang.IllegalArgumentException: No data found.
... 5 more
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR
##### ERROR MESSAGE: Unable to retrieve result
##### ERROR ------------------------------------------------------------------------------------------


Created 2015-05-08 08:02:41 | Updated | Tags: vqsr

Hi, I have several variant files that were generated by other calling tools with some annotations not defined by GATK. I wonder if I can also apply VQSR on these dataset restricting to those exclusive annotations (using "-an"). Is there any drawback for this application?

Besides, I saw somewhere that fitting only a single Gaussian distribution to each annotation. Is this a proper way to perform variant recalibration?

Created 2015-05-06 07:26:13 | Updated | Tags: vqsr

Hi,

I saw "LowQual" in the QUAL field of some records in VCF files which has been VQSR. I wonder whether these records have ever been used in the VQSR. If they have been used in that step, but without tagging something like "VQSRTrancheSNP99.90to100.00", does it mean that they are somehow ok in terms of VQSR?

Thank you!

Emma

Created 2015-04-20 23:41:13 | Updated | Tags: vqsr vcf gatk

Hi,

After using VQSR, I have a vcf output that contains sites labeled "." in the FILTER field. When I look at the vcf documentation (1000 genomes), it says that those are sites where filters have not been applied. Is this correct? I would like to know more about what these sites mean, exactly.

An example of such a site in my data is:

1 10439 . AC A 4816.02 . AC=10;AF=0.185;AN=54;BaseQRankSum=-4.200e-01;ClippingRankSum=-2.700e-01;DP=1690;FS=6.585;GQ_MEAN=111.04;GQ_STDDEV=147.63;InbreedingCoeff=-0.4596;MLEAC=17;MLEAF=0.315;MQ=36.85;MQ0=0;MQRankSum=-8.340e-01;NCC=0;QD=11.39;ReadPosRankSum=-8.690e-01;SOR=1.226 GT:AD:DP:GQ:PGT:PID:PL 0/1:22,14:36:99:0|1:10439_AC_A:200,0,825 0/0:49,0:49:0:.:.:0,0,533 0/0:92,0:92:0:.:.:0,0,2037 0/1:20,29:49:99:.:.:634,0,340 0/0:11,0:16:32:.:.:0,32,379 0/1:21,17:38:99:.:.:273,0,616 0/0:57,0:57:0:.:.:0,0,1028 0/0:58,0:58:0:.:.:0,0,1204 0/0:52,0:52:0:.:.:0,0,474 0/0:86,0:86:27:.:.:0,27,2537 0/1:13,24:37:99:.:.:596,0,220 0/1:14,34:48:99:.:.:814,0,263 0/0:86,0:86:0:.:.:0,0,865 0/0:61,0:61:0:.:.:0,0,973 0/0:50,0:50:0:.:.:0,0,648 0/0:40,0:40:0:.:.:0,0,666 0/0:79,0:79:0:.:.:0,0,935 0/0:84,0:84:0:.:.:0,0,1252 0/1:22,27:49:99:.:.:618,0,453 0/0:39,0:39:0:.:.:0,0,749 0/0:74,0:74:0:.:.:0,0,1312 0/1:13,18:31:99:.:.:402,0,281 0/0:41,0:44:99:.:.:0,115,1412 0/1:30,9:39:99:.:.:176,0,475 0/1:26,23:49:99:.:.:433,0,550 0/1:13,34:47:99:.:.:736,0,185 0/0:44,0:44:0:.:.:0,0,966

Thanks, Alva

Created 2015-04-01 14:17:52 | Updated | Tags: vqsr haplotypecaller best-practices gvcf

I am currently processing ~100 exomes and following the Best Practice recommendations for Pre-processing and Variant Discovery. However, there are a couple of gaps in the documentation, as far as I can tell, regarding exactly how to proceed with VQSR with exome data. I would be grateful for some feedback, particularly regarding VQSR. The issues are similar to those discussed on this thread: http://gatkforums.broadinstitute.org/discussion/4798/vqsr-using-capture-and-padding but my questions aren't fully-addressed there (or elsewhere on the Forum as far as I can see).

Prior Steps: 1) All samples processed with same protocol (~60Mb capture kit) - coverage ~50X-100X 2) Alignment with BWA-MEM (to whole genome) 3) Remove duplicates, indel-realignment, bqsr 4) HC to produce gVCFs (-ERC) 5) Genotype gVCFs

This week I have been investigating VQSR, which has generated some questions.

Q1) Which regions should I use from my data for building the VQSR model?

Here I have tried 3 different input datasets:

a) All my variant positions (11Million positions) b) Variant positions that are in the capture kit (~326k positions) - i.e. used bedtools intersect to only extract variants from (1) c) Variant positions that are in the capture kit with padding of 100nt either side (~568k positions) - as above but bed has +/-100 on regions + uniq to remove duplicate variants that are now in more than one bed region

For each of the above, I have produced "sensitive" and "specific" datasets: "Specific" --ts_filter_level 90.0 \ for both SNPs and INDELs "Sensitive" --ts_filter_level 99.5 \ for SNPs, and --ts_filter_level 99.0 \ for INDELs (as suggested in the definitive FAQ https://www.broadinstitute.org/gatk/guide/article?id=1259 )

I also wanted to see what effect, if any, the "-tranche" argument has - i.e. does it just allow for ease of filtering, or does it affect the mother generated, since it was not clear to me. I applied either 5 tranches or 6:

5-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \ for both SNPs and INDELs 6-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \ for both SNPs and INDELs

To compare the results I then used bed intersect to get back to the variants that are within the capture kit (~326k, as before). The output is shown in the spreadsheet image below.

What the table appears to show me, is that at the "sensitive" settings (orange background), the results are largely the same - the difference between "PASS" in the set at the bottom where all variants were being used, and the others is mostly accounted for by variants being pushed into the 99.9-100 tranche.

However, when trying to be specific (blue background), the difference between using all variants, or just the capture region/capture+100 is marked. Also surprising (at least for me) is the huge difference in "PASS" in cells E15 and E16, where the only difference was the number of tranches given to the model (note that there is very little difference in the analogous cells in Rows 5/6 andRows 10/11.

Q2) Can somebody explain why there is such a difference in "PASS" rows between All-SPEC and the Capture(s)-Spec Q3) Can somebody explain why 6 tranches resulted in ~23k more PASSes than 5 tranches for the All-SPEC Q4) What does "PASS" mean in this context - a score =100? Is it an observation of a variant position in my data that has been observed in the "truth" set? It isn't actually described in the header of the VCF, though presumably the following corresponds: FILTER= Q5) Similarly, why do no variants fall below my lower tranche threshold of 90? Is it because they are all reliable at least to this level?

Q6) Am I just really confused? :-(

Thanks in advance for your help! :-)

Created 2015-03-11 20:38:44 | Updated | Tags: vqsr best-practices variantfiltration variant-calling

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

Created 2015-02-25 02:12:39 | Updated | Tags: vqsr baserecalibrator haplotypecaller knownsites resources variant-recalibration

Hi, I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis. How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file? If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks, NB

Created 2015-02-12 16:44:10 | Updated | Tags: vqsr

when I finished VQSR, I got a vcf file "recalibrated_variants.vcf",

[wubin]$awk -F"\t" 'NR>161{print$7}' recalibrated_variants.vcf|sort|uniq -c 65902 LowQual 3163999 PASS 122377 VQSRTrancheINDEL90.00to99.00 53509 VQSRTrancheINDEL99.00to99.90 4589 VQSRTrancheINDEL99.90to100.00 742359 VQSRTrancheSNP90.00to99.00 368105 VQSRTrancheSNP99.00to99.90 184493 VQSRTrancheSNP99.90to100.00

If I want 99% truth sites sensitivity, I can discard sites of

VQSRTrancheINDEL99.00to99.90 VQSRTrancheINDEL99.90to100.00 VQSRTrancheSNP99.00to99.90 VQSRTrancheSNP99.90to100.00 LowQual

and retain sites of

PASS VQSRTrancheINDEL90.00to99.00

Am I right ?

Created 2015-02-11 22:56:25 | Updated | Tags: vqsr variantannotator

Hi,

In the best practices for vqsr in indel mode it is recommended to use the annotation SOR. However, when I try to add this annotation using VariantAnnotator it only adds it to the SNP calls not the indel calls. Does this mean SOR should not be used for vqsr in indel mode?

Thanks,

Kath

Created 2015-02-04 05:14:44 | Updated | Tags: variantrecalibrator vqsr vcf gatk

Hi,

I have generated vcf files using GenotypeGVCFs; each file contains variants corresponding to a different chromosome. I would like to use VQSR to perform the recalibration on all these data combined (for maximum power), but it seems that VQSR only takes a single vcf file, so I would have to combine my vcf files using CombineVariants. Looking at the documentation for CombineVariants, it seems that this tool always produces a union of vcfs. Since each vcf file is chromosome-specific, there are no identical sites across files. My questions are: Is CombineVariants indeed the appropriate tool for me to merge chromosome-specific vcf files, and is there any additional information that I should specify in the command-line when doing this? Do I need to run VariantAnnotator afterwards (I would assume not, since these vcfs were generated using GenotypeGVCFs and the best practices workflow more generally)? I just want to be completely sure that I am proceeding correctly.

Thank you very much in advance, Alva

Created 2015-02-02 21:24:31 | Updated | Tags: vqsr dbsnp vqslod genotypegvcfs gvcf

From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.

Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs: CHROM POS ID REF ALT chr1 1 . A # my data chr1 1 . A T # dbSNP I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below.

How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.

My detailed reason for needing this is below:

For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases:

1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2.
3. We do not have sufficient evidence to make any call for position n.

I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.

Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR.

Consequently, this seems to distinguish only between these two cases:

1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).

This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".

Douglas

Created 2015-01-27 20:25:42 | Updated | Tags: variantrecalibrator vqsr vcf gatk

Hi,

I ran VariantRecalibrator and ApplyRecalibration, and everything seems to have worked fine. I just have one question: if there are no reference alleles besides "N" in my recalibrate_SNP.recal and recalibrate_INDEL.recal files, and in the "alt" field simply displays , does that mean that none of my variants were recalibrated? Just wanted to be completely sure. My original file (after running GenotypeGVCFs) has the same number of variants as the recalibrated vcf's.

Thanks, Alva

Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs variant-calling

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Thanks very much in advance,

Alva

Created 2015-01-10 08:13:41 | Updated | Tags: unifiedgenotyper variantrecalibrator vqsr haplotypescore annotation

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired

I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.

Created 2014-12-02 23:21:07 | Updated | Tags: vqsr known-vcf

Hello, I am working on dog targeted sequencing data. In VQSR step, I got error as below. For the record, I use canFam3.fa (from UCSC hg19) as reference and Canis_familiaris.newchr.vcf (Ensembel) as reource file, the two files didn't get error in previous steps. Did anyone have similar problem, Thanks for tips !

INFO 18:07:03,352 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:07:03,355 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 18:07:03,355 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 18:07:03,355 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 18:07:03,359 HelpFormatter - Program Args: -T VariantRecalibrator -R canFam3.fa -input ./variant_calling/FGC0805.target.raw.snps.indels.vcf -resource:dbsnp,known=false,training=true,truth=false,prior=12.0 Canis_familiaris.newchr.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile ./variant_calling/FGC0805.target.recalibrate.SNP.recal -tranchesFile ./variant_calling/FGC0805.target.recalibrate.SNP.tranches -rscriptFile ./variant_calling/FGC0805.target.recalibrate.SNP.plots.R INFO 18:07:03,363 HelpFormatter - Executing as wangfan1@bioapps on Linux 2.6.32-358.14.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_65-b17. INFO 18:07:03,364 HelpFormatter - Date/Time: 2014/12/02 18:07:03 INFO 18:07:03,364 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:07:03,364 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:07:04,430 GenomeAnalysisEngine - Strictness is SILENT INFO 18:07:05,183 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 18:07:06,187 GenomeAnalysisEngine - Preparing for traversal INFO 18:07:06,217 GenomeAnalysisEngine - Done preparing for traversal INFO 18:07:06,218 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 18:07:06,219 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 18:07:06,219 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 18:07:06,231 TrainingSet - Found dbsnp track: Known = false Training = true Truth = false Prior = Q12.0 INFO 18:07:36,226 ProgressMeter - Starting 0.0 30.0 s 49.6 w 100.0% 30.0 s 0.0 s

##### ERROR ------------------------------------------------------------------------------------------

Created 2014-10-29 20:01:04 | Updated | Tags: vqsr random-forest

Hi there,

I hope I'm not being too forward here, but I was wondering if your group was still looking into implementing a RF model for VQSR (in particular I was hoping that it would help with smaller size datasets, in terms of the count of variant sites for smaller than exome captures) or if you have abandoned it?

Best Regards,

Kurt

Created 2014-10-29 02:38:38 | Updated | Tags: vqsr

I'm trying to run VQSR on a vcf I just called with HaplotypeCaller. Here is my command:

java -Xmx32g -jar /Commands/GATK/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /Reference/ucsc.hg19.fasta \ -input H3H5.HTC.raw.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /Reference/hapmap3.3.hg19.vcf \ -resource:omni,known=false,training=true,truth=false,prior=12.0 /Reference/1000G.omni2.5.hg19.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /Reference/1000G.ph1.SNP.HC.hg19.vcf \ -resource:dbsnp,known=true,training=true,truth=false,prior=6.0 /Reference/dbsnp138.hg19.vcf \ -an QD -an MQRankSum -an ReadPosRankSum -an FS \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile VQSR/H3H5.SNP.VQSR \ -tranchesFile VQSR/H3H5.SNP.Tranches \ -rscriptFile VQSR/H3H5.SNP.VQSR.R \ -nt 16

Each time I try to run VQSR it gives me this error:

INFO 12:23:34,122 VariantRecalibratorEngine - Finished iteration 95. Current change in mixture coefficients = 0.00198 INFO 12:23:34,122 VariantRecalibratorEngine - Convergence after 95 iterations! INFO 12:23:34,461 VariantRecalibratorEngine - Evaluating full set of 251205 variants... INFO 12:23:34,476 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 12:23:39,194 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

In the discussion of similar errors, I've seen that too little data and MQ annotation can cause similar problems, but I didn't use either of them.

I'm going to guess that it is something simple, but any help would be appreciated.

Chris

Created 2014-09-29 11:03:13 | Updated | Tags: variantrecalibrator vqsr

Hi GATK team,

our lab has a never ending discussion about running VQSR on related samples or having to exclude them. And i guess we need your help to settle this.

We have a multisample call (UG) run on ~1.500 samples, which contains all sorts of unrelated samples, trios and small families. Our statistician tries to convince us to exclude all related samples, because this might skew the VQSR model. The biologists don't follow this argument, but we are unable to convince each other. Do related samples disturb the VQSR?

Even more specific - if we run VQSR on tumor/normal pairs - should we expect surprising behaviour of the model or can we just run the recalibration without worries?

thanks for your help in advance, Oliver

Created 2014-09-18 12:12:57 | Updated | Tags: vqsr

Hello Geraldine,

First thank you a lot for your amazing work on this forum. My project deals with discovering rare population-specific variants in human exomes, and I would like to know how the VQSR step would affect the discovery of these variants. I was wondering whether it is better to perform VQSR on all the populations together (420 individuals but with a risk to clean out "true" rare population-specific variants ) or to run it by population (between 30 and 100 individuals each but I read that VQSR is loosing power with a reduced number of samples) ?

Thank you for your help, Best Marie

Created 2014-09-11 15:52:18 | Updated | Tags: vqsr haplotypecaller qualbydepth genotypegvcfs

Hey there,

How can it be possible that some of my snps or indels calls miss the QD tag? I'm doing the recommended workflow and I've tested if for both RNAseq (hard filters complains, that's how I saw those tags were missing) and Exome sequencing (VQSR). How can a hard filter applied on QD on a line without actually that tag can be considered to pass that threshold too? I'm seeing a lot more INDELs in RNAseq where this kind of scenario is happening as well.

Here's the command lines that I used :

# VQSR

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/References/hg19/VQSR/1000G_phase1.snps.high_confidence.b37.noGL.converted.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/References/hg19/VQSR/hapmap_3.3.b37.noGL.nochrM.converted.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ~/References/hg19/VQSR/1000G_omni2.5.b37.noGL.nochrM.converted.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ~/References/hg19/VQSR/dbsnp_138.b37.excluding_sites_after_129.noGL.nochrM.converted.vcf -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -rscriptFile 96Exomes.HC.TruSeq.snv.plots.R -R ~/References/hg19/hg19.fasta --maxGaussians 4 java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta

# HARD FILTER (RNASeq)

java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf Here are some examples for RNAseq : chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395 Thanks for your help. Created 2014-09-10 08:21:59 | Updated 2014-09-10 08:28:15 | Tags: vqsr vqsr-exome Hi, I am working on non-human species data and i have used VQSR in the analysis pipeline as shown below: If VQSR is performed, should we still consider filtering the variants on basequality and mapping quality? Created 2014-09-02 12:38:46 | Updated 2014-09-02 12:48:19 | Tags: vqsr bqsr gatk Hi there, I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step? Also what should I do when I reach to step- VQSR? Hope some GATK developers can help me on this. Thanks. Created 2014-07-09 15:31:45 | Updated | Tags: vqsr Hi, I have exome sequencing data on 90 samples, and my lab uses the VQSR filter to remove low quality variants. I was wondering if I should also perform a genotype-level filter by DP/GQ post this VQSR filtering step. Is there a protocol that is recommended, or some metrics I can look at to determine if such a step is required? Thanks, Shweta Created 2014-07-03 14:29:27 | Updated | Tags: vqsr r java -jar -Djava.io.tmpdir=temp/ -Xmx4g GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19.fa -input NA19240.raw.SNPs.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.refmt.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_138.b37.refmt.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an DP -mode SNP -recalFile NA19240.raw.SNPs.recal -tranchesFile NA19240.raw.SNPs.tranches -rscriptFile NA19240.snp.plots.R However, there is no NA19240.snp.plots.R.pdf generated. And I didn't find any error. When I try to run NA19240.snp.plots.R in R, source('NA19240.snp.plots.R'), there is an error: Error: Use 'theme' instead. (Defunct; last used in version 0.9.1) How can I fix it? Thanks!! Created 2014-06-30 07:11:50 | Updated | Tags: vqsr exome Hello, I've asked this question at the workshop in Brussels, and I would like to post it here: I'm working on an exome analysis on trio. I would like to run VQSR of filteration on the data. since this is an exome project, there are not a lot of varients, and therefore, as I understand. the VQSR is not accurate. You suggest to add more data from 1000Genomes or other published data. The families that I'm working on belongs to a very small and specific population, and I'm afraid that adding published data will add a lot of noise. What do you think, should I add more published data? change parameters such as maxGaussians? do hard filteration? Thanks, Maya Created 2014-06-10 18:52:15 | Updated | Tags: vqsr Hi there, So for the SNV model in VariantRecalibrator, I was using QD, MQRankSum, ReadPosRankSum, FS for a little while and then decided to add MQ back in since I saw that BP was updated recently and that was back in for BP. However; when I added MQ back in, and it went to train the negative model, it said it was training with 0 variants (same data set w/o using MQ in the model yielded ~30,000 variants to be used in the negative training model). I have attached a text file that has the base command line, followed by the log from the unsuccessful run and then followed by the successful run log. The version 3.1-1 and there are approx 700 exomes. Kurt Created 2014-04-23 13:35:51 | Updated | Tags: vqsr Hi, I am working on VQSR step (using GATK 2.8.1) on variants which have been called by UG from ~500 whole genomes of cattle . I run VariantRecalibrator as following: ${JAVA} ${GATK}/GenomeAnalysisTK.jar -T VariantRecalibrator \ -R${REF} -input ${OUTPUT}/GATK-502-sorted.full.vcf.gz \ -resource:HD,known=false,training=true,truth=true,prior=15.0 HD_bosTau6.vcf \ -resource:JH_F1,known=false,training=true,truth=false,prior=10.0 F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 BosTau6_dbSNP138_NCBI.vcf \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP -an HaplotypeScore \ -mode SNP \ -recalFile${OUTPUT}/gatk_502_sorted_fixed.recal \
-tranchesFile ${OUTPUT}/gatk_502_sorted_fixed.tranches \ -rscriptFile${OUTPUT}/gatk_502_sorted_fixed.plots.R

HD_bosTau6.vcf : ~770k markers on Illumina bovine high-density chip array

F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf : ~5.4M SNPs

The tranches pdf I got looks really weird, please check the attached file.

Then I tried to vary the 'prior' score of trainning VCF, and also supply additional VCF file from another project as training datasets. And I still got the similar tranches graph as above. e.g.:

-resource:HD,known=false,training=true,truth=true,prior=15.0  HD_bosTau6.vcf
-resource:JH_F1,known=false,training=true,truth=false,prior=12.0  F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf
-resource:DN,known=false,training=true,truth=false,prior=12.0  HC-Plat-FB.3in3.vcf.gz
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0  BosTau6_dbSNP138_NCBI.vcf 

HC-Plat-FB.3in3.vcf.gz : ~ 14M markers

It is worthy to mention that I have done VariantRecalibrator step with the same parameters and training sets on another 50 whole genomes very recently, and it worked fine. Actually I have done VariantRecalibrator on the 500 animals before when I accidentally took a unfiltered VCF called by UG as training set. Surprisingly, I got good tranches graph that time, similar to the graph posted on GATK best practice. Do you have any suggestion for me?

Thanks,

Created 2014-04-15 21:37:57 | Updated | Tags: variantrecalibrator vqsr

Hi,

Sorry to bother you guys. Just a few quick questions:

1) I'm attempting to download the bundles for VQSR and I noticed that they are for b37 or hg19. If I performed my initial assemblies and later SNP calls with hg38, will this cause an issue? Should I restart the process using either b37 or hg19?

2) I'm still a bit lost on what is considered "too few variants" for VQSR. As VQSR works best when there are thousands of variants - is this recommendation on a per sample basis or for an entire project? I'm presently working with sequences from 80 unique samples for a single gene (~100kbp) and HaplotypeCaller detects on average ~300 raw snps. Would you recommend I hard filter instead in my case?

Thanks,

Dave

Created 2014-04-10 17:59:18 | Updated | Tags: vqsr

hi, Geraldine, Thanks for the webinar! You mentioned that VQSR isn't necessary for a single exome. But would there be any drawback to run it on a single exome? I see that it helps to set up the PASS filter.

Created 2014-03-26 19:55:01 | Updated | Tags: vqsr indels

Hi all --

This should be a simple problem -- I cannot find a valid version of the Mills indel reference in the resource bundle, or anywhere else online!

All versions of the reference VCF are stripped of genotypes and do not contain a FORMAT column or any additional annotations.

I am accessing the Broad's public FTP, and none of the Mills VCF files in bundle folders 2.5 or 2.8 contain a full VCF. I understand that there are "sites only" VCF, but I can't seem to find anything else.

Can anyone link me to a version that contains the recommended annotations for indel VQSR, or that can be annotated?

Created 2014-02-26 15:33:35 | Updated 2014-02-26 16:17:03 | Tags: vqsr

INFO  17:05:50,124 GenomeAnalysisEngine - Preparing for traversal
INFO  17:05:50,144 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:05:50,144 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:05:50,145 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:05:50,166 TrainingSet - Found hapmap track:    Known = false   Training = true     Truth = true    Prior = Q15.0
INFO  17:05:50,166 TrainingSet - Found omni track:  Known = false   Training = true     Truth = false   Prior = Q12.0
INFO  17:05:50,167 TrainingSet - Found dbsnp track:     Known = true    Training = false    Truth = false   Prior = Q6.0
INFO  17:06:20,149 ProgressMeter -     1:216404576        2.04e+06   30.0 s       14.0 s      7.0%         7.2 m     6.7 m
INFO  17:06:50,151 ProgressMeter -     2:223579089        4.70e+06   60.0 s       12.0 s     15.2%         6.6 m     5.6 m
INFO  17:07:20,159 ProgressMeter -      4:33091662        7.43e+06   90.0 s       12.0 s     23.3%         6.4 m     4.9 m
INFO  17:07:50,161 ProgressMeter -      5:92527959        1.00e+07  120.0 s       11.0 s     31.4%         6.4 m     4.4 m
INFO  17:08:20,162 ProgressMeter -       7:1649969        1.30e+07    2.5 m       11.0 s     39.8%         6.3 m     3.8 m
INFO  17:08:50,168 ProgressMeter -     8:106975025        1.58e+07    3.0 m       11.0 s     48.4%         6.2 m     3.2 m
INFO  17:09:20,169 ProgressMeter -    10:101433561        1.87e+07    3.5 m       11.0 s     57.4%         6.1 m     2.6 m
INFO  17:09:50,170 ProgressMeter -     12:99334147        2.16e+07    4.0 m       11.0 s     66.1%         6.1 m     2.1 m
INFO  17:10:20,171 ProgressMeter -     15:30577012        2.41e+07    4.5 m       11.0 s     75.4%         6.0 m    88.0 s
INFO  17:10:52,409 ProgressMeter -      18:8763648        2.68e+07    5.0 m       11.0 s     83.5%         6.0 m    59.0 s
INFO  17:11:22,410 ProgressMeter -     22:31598896        2.97e+07    5.5 m       11.0 s     92.2%         6.0 m    27.0 s
INFO  17:11:33,135 VariantDataManager - QD:      mean = 17.48    standard deviation = 9.03
INFO  17:11:33,516 VariantDataManager - HaplotypeScore:      mean = 3.03     standard deviation = 2.62
INFO  17:11:33,882 VariantDataManager - MQ:      mean = 52.40    standard deviation = 2.98
INFO  17:11:34,253 VariantDataManager - MQRankSum:   mean = 0.31     standard deviation = 1.02
INFO  17:11:37,973 VariantDataManager - Training with 1024360 variants after standard deviation thresholding.
INFO  17:11:37,977 GaussianMixtureModel - Initializing model with 30 k-means iterations...
INFO  17:11:53,065 ProgressMeter - GL000202.1:10465        3.08e+07    6.0 m       11.0 s     99.8%         6.0 m     0.0 s
INFO  17:12:09,041 VariantRecalibratorEngine - Finished iteration 0.
INFO  17:12:23,066 ProgressMeter - GL000202.1:10465        3.08e+07    6.5 m       12.0 s     99.8%         6.5 m     0.0 s
INFO  17:12:30,492 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 0.08178
INFO  17:12:51,054 VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.05869
INFO  17:12:53,072 ProgressMeter - GL000202.1:10465        3.08e+07    7.0 m       13.0 s     99.8%         7.0 m     0.0 s
INFO  17:13:11,207 VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.15237
INFO  17:13:23,073 ProgressMeter - GL000202.1:10465        3.08e+07    7.5 m       14.0 s     99.8%         7.5 m     0.0 s
INFO  17:13:31,503 VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.13505
INFO  17:13:51,768 VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.05729
INFO  17:13:53,080 ProgressMeter - GL000202.1:10465        3.08e+07    8.0 m       15.0 s     99.8%         8.0 m     0.0 s
INFO  17:14:11,372 VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.02607
INFO  17:14:23,081 ProgressMeter - GL000202.1:10465        3.08e+07    8.5 m       16.0 s     99.8%         8.5 m     0.0 s
INFO  17:14:24,730 VariantRecalibratorEngine - Convergence after 33 iterations!
INFO  17:14:27,037 VariantRecalibratorEngine - Evaluating full set of 3860460 variants...
INFO  17:14:51,111 VariantDataManager - Found 0 variants overlapping bad sites training tracks.
INFO  17:14:55,071 VariantDataManager - Additionally training with worst 1000 scoring variants --> 1000 variants with LOD <= -30.5662.
INFO  17:14:55,071 GaussianMixtureModel - Initializing model with 30 k-means iterations...
INFO  17:14:55,082 VariantRecalibratorEngine - Finished iteration 0.
INFO  17:14:55,095 VariantRecalibratorEngine - Convergence after 4 iterations!
INFO  17:14:55,096 VariantRecalibratorEngine - Evaluating full set of 3860460 variants...
INFO  17:15:02,071 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.7-2-g6bda569):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --numBad 3000, for example).
##### ERROR ------------------------------------------------------------------------------------------

My command is :

java -jar -Xmx4g GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37.fasta -input NA12878_snp.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_132.b37.vcf -an QD -an HaplotypeScore -an MQ -an MQRankSum --maxGaussians 4 -mode SNP -recalFile NA12878_recal.vcf -tranchesFile NA12878_tranches -rscriptFile NA12878.plots.R

Before I didn't use -maxGaussians 4, once an error suggested this, I tried but still got this error message...And I think that numBad is already deprecated. I don't understand why this error will happen. I'm doing GATK unifiedgenotyper on 1000Genomes high coverage bam file and then use VQSR to filter the snp.

Created 2014-02-26 15:13:15 | Updated 2014-02-26 15:35:08 | Tags: vqsr

hi i run VQSR on the vcf file generated by unified genotyper and filtered PASS 63412 out of 86840 (files with snps and indels). as i run unified genotyper with -glm BOTH command. i have two questions

1) the number of pass snps are different when i counted them in two ways(first with original output of UG and other by separating snps and indel into two separate files using awk script

grep -v "#" sample1_recalibrated_snps_PASS.vcf | grep -c "PASS"
63412
grep -v "#" sample1_merged_recalibrated_snps_raw_indels.vcf| grep -c "LowQual“
18725

Statistics for separate snp file. here i use awk script to separate snps and indels (using awk script)

Rest is fine only problem is that pass snps no differ think why

grep -v  "^#" sample1_snp.vcf| grep -c "PASS
63402
grep -v  "^#" sample1_snp.vcf| grep -c "LowQual“
18725

2) i run VQSR on snps generated by unified genotyper i need to ask query about VQSR tranche plot for Snps. in my case tranche is not showing any false positive call see plot attached what do i interpret that there is no FP which seems surprising

when i tried to run VQSR on INDELS (in the same file) it doesnt work as i had 884 indels which i read from VQSR documentation and questions asked by ppl is small.

Created 2014-02-24 22:22:31 | Updated | Tags: vqsr filter gatk

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T VariantFiltration \
-o output.vcf \
--variant input.vcf \
--filterExpression "AB < 0.2 || MQ0 > 50" \
--filterName "Nov09filters" \
--maskName InDel

Created 2014-02-24 22:21:58 | Updated | Tags: vqsr filter gatk

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T VariantFiltration \
-o output.vcf \
--variant input.vcf \
--filterExpression "AB < 0.2 || MQ0 > 50" \
--filterName "Nov09filters" \
--maskName InDel

Created 2014-02-24 16:12:14 | Updated | Tags: variantrecalibrator vqsr applyrecalibration indels

Hi,

Given that there's no tranche plot generated for indels using VariantRecalibrator, how do we assess which tranche to pick for the next step, ApplyRecalibration? On SNP mode, I'm using tranche plots to evaluate the tradeoff between true and false positive rates at various tranche levels, but that's not possible with indels.

Thanks!

Grace

Created 2014-02-20 15:30:32 | Updated | Tags: vqsr snpcalling

Hi - I have a question on how best to do VQSR on my samples. One of the readgroups for my individuals are from genomic DNA and have very even coverage (around 10x) while the remaining 4-5 readgroups in the individuals are from Whole Genome Amplified (WGA) DNA. The WGA readgruops have very uneven coverage ranging from 0 to over a 1000 with a mean of around 30x (see attached image, blue is wga and turquoise is genomic, y-axis is depth and x-axis is sliding windows along a chromosome). So I have WGA and genomic libs for each individual and their coverage distributions are very different.

We tested different SNP calling (Unified Genotyper) and VSQR strategies and at the moment we think a strategy where we call and vqsr the genomic and wga libs separately and then combine them in the end works best. However I am interested on what the GATK team would have done in such a case. The reason we are doing it separately is that we think the vqsr on the combined libs would not be wise since there is such difference in the depth (and strand bias) between the WGA and genomic readgroups. If there was a way in the VQSR step to incorporate read group difference and include it in the algortihm it could maybe solve such a problem - but as far as I can see there is no such thing (we used the ReadGroupblacklist option when calling the RGs separately) - but for VQSR there is not a "include read group effects" kind of option. Or does it intrinsically include read group information in the machine learning step? By the way - we did the BQSR so the qualities would have been adjusted according to readgroup effects. But still there does seem to be a noticeable difference between the VQSR results we get from WGA vs genomic read groups (for instance WGA readgroups have consistently lower Hz than genomic readgroups calls - which we think is due to strand bias). From the VQSR plots it is clear that many SNPs are excluded in the WGA RGs due to strand bias and DP - however the bias is still visible after VQSR.

Sorry for the elaborate explanation - however - my question is how the GATK team would have handled SNPcalling and VQSR if the RG depth vary that much as in the attached image case.

Created 2014-02-20 15:27:51 | Updated | Tags: vqsr snpcalling

Hi - I have a question on how best to do VQSR on my samples. One of the readgroups for my individuals are from genomic DNA and have very even coverage (around 10x) while the remaining 4-5 readgroups in the individuals are from Whole Genome Amplified (WGA) DNA. The WGA readgruops have very uneven coverage ranging from 0 to over a 1000 with a mean of around 30x (see attached image, blue is wga and turquoise is genomic, y-axis is depth and x-axis is sliding windows along a chromosome). So I have WGA and genomic libs for each individual and their coverage distributions are very different.

We tested different SNP calling (Unified Genotyper) and VSQR strategies and at the moment we think a strategy where we call and vqsr the genomic and wga libs separately and then combine them in the end works best. However I am interested on what the GATK team would have done in such a case. The reason we are doing it separately is that we think the vqsr on the combined libs would not be wise since there is such difference in the depth (and strand bias) between the WGA and genomic readgroups. If there was a way in the VQSR step to incorporate read group difference and include it in the algortihm it could maybe solve such a problem - but as far as I can see there is no such thing (we used the ReadGroupblacklist option when calling the RGs separately) - but for VQSR there is not a "include read group effects" kind of option. Or does it intrinsically include read group information in the machine learning step? By the way - we did the BQSR so the qualities would have been adjusted according to readgroup effects. But still there does seem to be a noticeable difference between the VQSR results we get from WGA vs genomic read groups (for instance WGA readgroups have consistently lower Hz than genomic readgroups calls - which we think is due to strand bias). From the VQSR plots it is clear that many SNPs are excluded in the WGA RGs due to strand bias and DP - however the bias is still visible after VQSR.

Sorry for the elaborate explanation - however - my question is how the GATK team would have handled SNPcalling and VQSR if the RG depth vary that much as in the attached image case.

Created 2014-01-21 13:32:12 | Updated | Tags: vqsr selectvariants vcf

I just wanted to select variants from a VCF with 42 samples. After 3 hours I got the following Error. How to fix this? please advise. Thanks I had the same problem when I used "VQSR". How can I fix this problem?

INFO 20:28:17,247 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,250 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 20:28:17,250 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 20:28:17,251 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 20:28:17,255 HelpFormatter - Program Args: -T SelectVariants -rf BadCigar -R /groups/body/JDM_RNA_Seq-2012/GATK/bundle-2.3/ucsc.hg19/ucsc.hg19.fasta -V /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf -L chr1 -L chr2 -L chr3 -selectType SNP -o /hms/scratch1/mahyar/Danny/data/Filter/extract_SNP_only3chr.vcf INFO 20:28:17,256 HelpFormatter - Date/Time: 2014/01/20 20:28:17 INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,305 ArgumentTypeDescriptor - Dynamically determined type of /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf to be VCF INFO 20:28:18,053 GenomeAnalysisEngine - Strictness is SILENT INFO 20:28:18,167 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 20:28:18,188 RMDTrackBuilder - Creating Tribble index in memory for file /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf INFO 23:15:08,278 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.NegativeArraySizeException at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:97) at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:116) at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:84) at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:73) at net.sf.samtools.util.AbstractIterator.next(AbstractIterator.java:57) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:46) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:24) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35) at org.broad.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40) at org.broad.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:428) at org.broad.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:390) at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:288) at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:278) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:964) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:758) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

##### ERROR ------------------------------------------------------------------------------------------

Created 2014-01-13 11:55:08 | Updated | Tags: vqsr haplotypecaller exome variant-calling

We are running GATK HaplotypeCaller on ~50 whole exome samples. We are interested in rare variants - so we ran GATK in single sample mode instead of multi sample as you recommend, however we would like to take advantage of VQSR. What would you recommend? Can we run VQSR on the output from GATK single sample?

Additionally, we are likely to run extra batches of new exome samples. Should we wait until we have them all before running them through the GATK pipeline?

Many thanks in advance.

Created 2013-12-31 02:11:46 | Updated 2013-12-31 02:12:36 | Tags: vqsr best-practices non-human

Hello there! Thanks as always for the lovely tools, I continue to live in them.

• Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores. Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.

Methods Thus Far

We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).

As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.

My semi-educated guess..

The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?

Created 2013-11-14 17:19:47 | Updated | Tags: variantrecalibrator vqsr

I'm somewhat struggling with the new negative training model in 2.7. Specifically, this paragraph in the FAQ causes me trouble:

Finally, please be advised that while the default recommendation for --numBadVariants is 1000, this value is geared for smaller datasets. This is the number of the worst scoring variants to use when building the model of bad variants. If you have a dataset that's on the large side, you may need to increase this value considerably, especially for SNPs.

And so I keep thinking about how to scale it with my dataset, and I keep wanting to just make it a percentage of the total variants - which is of course the behavior that was removed! In the Version History for 2.7, you say

Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number

Can you comment a little further about how it scales? I'm assuming it's non-linear, and my intuition would be that smaller sets have proportionally more bad variants. Is that what you've seen? Do you have any other observations that could help guide selection of that parameter?

Created 2013-11-01 20:03:17 | Updated 2013-11-01 20:04:57 | Tags: vqsr gatk

I have the following entries in my vcf files output from VQSR. What does the "VQSRTrancheINDEL99.00to99.90" string mean? did they fail the recalibration?

PASS
VQSRTrancheINDEL99.00to99.90
VQSRTrancheINDEL99.00to99.90
VQSRTrancheINDEL99.00to99.90
PASS
VQSRTrancheINDEL99.00to99.90
PASS
PASS
VQSRTrancheINDEL99.90to100.00
VQSRTrancheINDEL99.90to100.00
VQSRTrancheINDEL99.90to100.00
PASS
VQSRTrancheINDEL99.00to99.90
VQSRTrancheINDEL99.00to99.90

Below is the command I used:

java -Xmx6g -jar \$CLASSPATH/GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R GATK_ref/hg19.fasta \
-nt 5 \
--input ../GATK/VQSR/parallel_batch/combined_raw.snps_indels.vcf \
-mode INDEL \
--ts_filter_level 99.0 \
-recalFile ../GATK/VQSR/parallel_batch/Indels/exome.indels.vcf.recal \
-tranchesFile ../GATK/VQSR/parallel_batch/Indels/exome.indels.tranches \
-o ../GATK/VQSR/parallel_batch/Indels/exome.indels.filtered.vcf

Created 2013-09-10 10:47:23 | Updated | Tags: vqsr

Hi, Thanks very much for your answers for my previous questions. It seems that I encountered another difficulties when I run the QVSR steps because some ERROR information was spotted on the screen. These Error info is as follows:

INFO 18:10:01,046 GaussianMixtureModel - Initializing model with 30 k-means iterations... INFO 18:10:01,165 VariantRecalibratorEngine - Finished iteration 0. INFO 18:10:01,186 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.15059 INFO 18:10:01,196 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.06115 INFO 18:10:01,206 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.34881 INFO 18:10:01,208 VariantRecalibratorEngine - Convergence after 16 iterations! INFO 18:10:01,211 VariantDataManager - Found 0 variants overlapping bad sites training tracks. INFO 18:10:27,971 ProgressMeter - chr1:249230318 4.34e+06 90.0 s 20.0 s 100.0% 90.0 s 0.0 s

##### ERROR ------------------------------------------------------------------------------------------

I think the parameter I set are all right:

java -jar /ifs1/ST_POP/USER/lantianming/HUM/bin/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -R /ifs1/ST_POP/USER/lantianming/HUM/reference_human/chr1.fa --maxGaussians 4 -numBad 4000 -T VariantRecalibrator -mode SNP
-input /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.recal_10.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/dbsnp_137.hg19.vcf
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/1000G_omni2.5.hg19.vcf -an DP -an FS -an HaplotypeScore -an MQ0 -an MQ -an QD -recalFile /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.vcf.snp_11.recal -tranchesFile /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.vcf.snp_11.tranches -rscriptFile /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.vcf.snp_11.plot.R -nt 4 --TStranche 90.0 --TStranche 93.0 --TStranche 95.0 --TStranche 97.0

My input file is chr1 AND the sequencing depth is about 1× AND 4000 snp sites were call out by using UnifiedGenotyper. So what I am not sure is that whether the number of snp sites were enough for doing VQSR? Could you please give me some suggestions? thanks very much!!!

Created 2013-08-11 16:50:39 | Updated | Tags: vqsr

When using GENOTYPE_GIVEN_ALLELES with HaplotypeCaller, which uses EMIT_ALL_SITES and so has many calls where the entire cohort is nonvariant, do these reference only sites have to be filtered out before calling VQSR?

Created 2013-07-31 13:21:04 | Updated | Tags: vqsr

Hi,

I am working on dog genome and trying to use VQSR on my data.

Here is the command i have used:

java -Xmx4G -jar GenomeAnalysisTK.jar -R genome.fa -T VariantRecalibrator -input GATK-snp.vcf -resource:dbsnp,known=false,training=true,truth=true,prior=6.0 canFam3_SNP.vcf -mode SNP -recalFile output.recal -tranchesFile output.tranches -rscriptFile output.plots.R -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an Inbreed

1. I have only dbSNP file as training set and i have set the options, known=true,training=false,truth=false,prior=6.0 in the command line as per the documentation. But that doesn't work and instead suggested to use known=false,training=true,truth=true,prior=6.0. What is the prior =6.0 here? is there any threshold for prior?

2.The above command produces empty tranches and recal file.

3.Even though the files are empty i have proceeded to ApplyRecalibration with the below command:

java -Xmx4G -jar GenomeAnalysisTK.jar -R genome.fa -T ApplyRecalibration -input GATK-snp.vcf --ts_filter_level 99.0 -tranchesFile output.tranches -recalFile output.recal -mode SNP -o recalibrated.filtered.vcf.

It gives the error:

ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:

##### ERROR

Any help to fix these?

Created 2013-07-15 12:45:29 | Updated | Tags: vqsr

Hi team, thanks for a great job developing this software!

I am planning to use the GATK in a class as a demo of how to do SNP detection and the VQSR in a non-model organism, but due to time constraints I have a very small dataset (12 samples of 100K reads each).

I am using a SNP Q>20 for an initial round of SNP detection, which I then use as a "true" training set for the VQSR and use a call set with Q>3 as my variants of interest.

I keep getting the error message "NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)"

which is not surprising, even though I have already set --maxGaussians 2 -percentBad 0.01 -minNumBad 50

to reiterate, this is for educational purposes - I am wondering if I can move past this error message and get an output file despite this error?

Thanks!

/Pierre De Wit

Created 2013-07-03 16:16:32 | Updated | Tags: vqsr

How should I use the VQSR -tranche argument?

From the tutorial I get that I should specify the list of doubles like this: -tranche [100.0, 99.9, 99.0, 90.0] http://www.broadinstitute.org/gatk/guide/topic?name=tutorials#id2805

But when I try that like this java -jar GenomeAnalysisTK-2.6-3-gdee51c4/GenomeAnalysisTK.jar -T VariantRecalibrator -R ref.fa -input input.vcf -resource:snparray,known=true,training=true,truth=true,prior=15.0 input_concordantW_SNPArray.vcf -an QD -an ReadPosRankSum -an MQRankSum -an MQ -an FS -an DP -an ClippingRankSum -an BaseQRankSum -an AF -titv 2.5 --mode SNP -recalFile input.recal -tranchesFile input.tranches -rscriptFile input.plots.R -tranche [100.0, 99.9, 99.0, 90.0]

I get

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

Created 2013-06-29 20:51:03 | Updated | Tags: vqsr

Hi,

Maybe I have not been able to find some obvious piece of documentation, but I am searching for best practices in using VQSR with sex chromosomes (especially X)? I am trying to do variant calling on Anopheles gambiae genomes (sex chromosomes like human) and the results with chromosome X are not very encouraging. I was wondering if there is any documentation/best practices for VQSR with especially X. Or even if people are using VQSR with sex chromosomes?

Clueless and lost, Tiago

Created 2013-06-08 06:32:45 | Updated | Tags: vqsr exome

Hi, I'm working with trios and small-pedigrees (up to six individuals). The VQSR section of the 'best practice' document states that 'in order to achieve the best exome results one needs an exome callset with at least 30 samples', and suggests to add additional samples such as 1000 genomes BAMs.
I' a little confused about two aspects:
1) the addition of 1000G BAMs being suggested in the VQSR section. If we need the 1000G call sets, we'd have to run these through the HaplotypeCaller or UnifiedGenotyper stages? Please forgive the question - I'm not trying to find fault in your perfect document, but please confirm as it would dramatically increase compute time (though only once), and overlaps with my next point of confusion:
2) I can understand how increasing the number of individuals from a consistent cohort, or maybe even from very similar experimental platforms, improves the outcome of the VQSR stage. However, the workshop video comments that the variant call properties are highly dependent on individual experiments (design, coverage, technical, etc). So I can't understand how the overall result is improved when I add variant calls from 30 1000G exomes (with their own typical variant quality distributions) to my trio's sample variant calls (also with their own, but very different to the 1000G's, quality distribution).

Hopefully I'm missing an important point somewhere? Many thanks in advance, K

Created 2013-06-03 17:08:53 | Updated | Tags: variantrecalibrator vqsr plot gaussian misture model

HI again!

Could you please help me to generate the first plot in the attached file which refers to VariantRecalibrator?

In other words, is this plot generated at the same time as my_sample.bqrecal.vqsr.R.scripts.pdf? If so, maybe some R library is missing but i can't find anything wrong in the log files (my_sample.bqrecal.vqsr.R.scripts.pdf seems to me fine adn healthy).

Thanks in advance, Rodrigo.

Created 2013-05-13 14:10:51 | Updated | Tags: vqsr indels vqslod

Hi Mark, Eric -

First, I wanted to thank you guys for providing advice with respect to running VQSR. I am already sold and a huge fan of the method :-).

I was wondering if either of you could comment on VQSLOD and sensitivity filter tranche? To be more specific, if I set a filter threshold of 99% for sensitivity and VQSLOD < 0 I imagine that probably is not a good idea! However, a VQSLOD of 3 or 5 may be appropriate in the statistical sense, i.e. pretty confident that this is a real variant. Finally, I am thinking we should include VQSLOD in our statistical genetic association mapping methods. I wanted to get a sense from either of you what VQSLOD you would want to completely remove from analysis?

Best Wishes,

Manny.

Created 2013-05-06 15:19:46 | Updated | Tags: vqsr variantannotator

Hi,

I just run HyplotypeCaller on a dataset. For the same dataset, I have run through Unified genotyper and then directly subjected the raw vcf from UG to VQSR step without the help of VariantAnnotator before and get through VQSR without any problem. However, when I try to subject the raw callset derived from HyplotypeCaller directly to VQSR step, the VQSR module complained about it and error message is below:

...

#### ERROR MESSAGE: Bad input: Values for HaplotypeScore annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See

So after HyplotypeCaller, the derived vcf file needs to run though VariantAnnotator? Since Unified genotyper derived callset does not need the help of VariantAnnotator (all annotations needed for VQSR are included after UG), it seems not the case for HyplotypeCaller? I can run through VariantAnnotator for HyplotypeCaller derived vcf file, just want to make sure if my understanding is correct?

Thanks and best

Mike

Created 2013-03-25 22:10:53 | Updated | Tags: vqsr mouse

I was wondering if anyone has used VQSR for a mouse related genome project. I am working with mm10 dbsnp and DNA-seq short insert data for multiple homozygous mouse samples. I have obtained decent results so far using the mm10 dbsnp as the training set, but was curious to see if anyone had any recommendations as to what settings to use. Any input is appreciated. I also have a lot of RNA-seq data, but that will come at a much later point in time. Thanks!

Created 2013-03-06 05:20:13 | Updated | Tags: vqsr multi-sample

Hi,

I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG). Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.

Many thanks.

Created 2013-02-02 15:23:08 | Updated 2013-02-02 22:49:30 | Tags: unifiedgenotyper vqsr haplotypecaller validation statistics sensitivity specificity

Hi all, I've somewhere in this site that before VQSR the FP rate is expected to be around 10% (I guess for UnifiedGenotyper). Are there some updated statistics for VQRS? For HaplotypeCaller? For Exome/WG data? Another thing: we apply VQRS on all our analysis, we are trying to collect some validation statistics. We suspect that most of the FP have some particular "culprits" in VQRS (especially QD and MQ). Do you have some data about this? Best

d

Created 2012-12-31 03:42:30 | Updated | Tags: vqsr errorthrowing

I am seeing this error on single human WGS sample -

The provided VCF file is malformed at approximately line number "x": there are 557 genotypes while the header requires that 1525 genotypes be present for all records

Interestingly, when I run VQSR as part of the same pipeline on the same sample consecutive times, the "x" changes to different line numbers each time. I was wondering if someone could explain the meaning of the error message more?

Created 2012-12-14 14:47:59 | Updated | Tags: vqsr

Hi,

Recently I run into some odd observation in VQSR. I have 17 samples from a same family and I used all of 17 samples to call SNPs and after VQSR, I got the trench file like this:

# Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,48637,716,2.9527,2.3302,4.8390,VQSRTrancheSNP0.00to90.00,SNP,26182,23563,0.9000 99.00,60114,1531,2.8057,2.3333,1.7766,VQSRTrancheSNP90.00to99.00,SNP,26182,25920,0.9900 99.90,67220,2884,2.7190,1.8222,-10.0009,VQSRTrancheSNP99.00to99.90,SNP,26182,26155,0.9990 100.00,69714,4998,2.6822,1.8300,-1122.0698,VQSRTrancheSNP99.90to100.00,SNP,26182,26182,1.0000

which seems fine. then for research purpose, I only used 5 samples of more tight relation such as two parents and their 3 immediate children and after VQSR, the trench file looks like below:

# Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP0.00to90.00,SNP,20850,20850,1.0000 99.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP90.00to99.00,SNP,20850,20850,1.0000 99.90,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.00to99.90,SNP,20850,20850,1.0000 100.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.90to100.00,SNP,20850,20850,1.0000

Notice that the 5-sample VQSR tranch file has exactly the same thing throughout all thresholds: 90, 99, 99.90 and 100. and the VQSR modeling plot is also very odd, no plotting at all being seen (the pdf ifle was created but was almost blank in contrast to the normal projection plots I saw in other cases)

However, we did use the old version to call the same 5 samples before, and the trench file looks like below:

# Version number 4

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,36407,361,2.8657,2.3119,5.0854,TruthSensitivityTranche0.00to90.00,20814,18732,0.9000 99.00,44097,638,2.7655,2.2222,2.2592,TruthSensitivityTranche90.00to99.00,20814,20605,0.9900 99.90,47947,1061,2.7078,1.8750,-7.4143,TruthSensitivityTranche99.00to99.90,20814,20793,0.9990 100.00,50426,2318,2.6645,1.7677,-647.3944,TruthSensitivityTranche99.90to100.00,20814,20814,1.0000

this time, it looks reasonable to me. This is troubling us since for 5 samples, the old version (V1.6-7) seems working fine, whereas the new version (V2.1-13) seems having issue or can not get further filtering by VQSR (90, 99 and 100 got the same result, I did repeat multiple times and got the same results), although for all of the 17 samples, the new version seems fine on VQSR.

So my questions are:

1. is it possible that in some occasion, VQSR can simply not work?
2. Why the old version seems working but not the new version for exactly the same set of 5-sample data?

Thanks a lot for your help!

Mike

Created 2012-11-27 16:43:57 | Updated 2012-11-27 16:50:38 | Tags: unifiedgenotyper vqsr

Hi,

I observed a significant difference of the variant call sets from the same exomes between v1.6 and v2.2(-10). In fact, I observed a significant decrease in the overall novel TiTv in the latter call sets from around 2.6 to 2.1 at TruthSensitivity threshold at 99.0. When I looked at a sample to compare variant sites using VariantEval, it showed that

Filter JexlExpression Novelty nTi nTv tiTvRatio
called Intersection known 14624 4563 3.2
called Intersection novel 856 312 2.74
called filterIngatk22-gatk16 known 264 132 2
called filterIngatk22-gatk16 novel 28 18 1.56
called gatk16 known 3 1 3
called gatk16 novel 1 1 1
called gatk22-filterIngatk16 known 258 94 2.74
called gatk22-filterIngatk16 novel 144 425 0.34
called gatk22 known 2 2 1
called gatk22 novel 17 30 0.57
filtered FilteredInAll known 1344 649 2.07
filtered FilteredInAll novel 1076 1642 0.66

The novel TiTv of new calls in v2.2 not found in v1.6 or called in v2.2 but filtered in v1.6 demonstrated novel TiTv around 0.5. So I suspect that VQSLOD scoring (or ranking) of SNPs was changed substantially in somewhat an unfavorable way.

The major updates in v2.2 affecting my result were BQSRv2, ReduceReads, UG and VariantAnnotation. (Too many things to pin-point the culprit...) The previous BAM processing and variant calls were made using v1.6. For the new call set, I used v2.1-9 (so after serious bug fix in ReduceReads, thank you for the fix) for BQSRv2 and ReduceReads and v2.2-10 for UG and VQSR.

As a first clue, I found that distribution of FS values changed dramatically from the v1.6 (please see attached plots). Although I recognized that FS value calculations were recently updated, the distribution of previous FS values (please see attached) makes more sense for me because the current FS values do not seem to provide us information to classify true positives and false positives.

Thanks in advance. Katsuhito

Created 2012-11-13 10:13:24 | Updated | Tags: vqsr gatk error

Hi all, I'm running VariantRecalibrator on a SNP set (47 exomes) and I get this error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.2-3-gde33222):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
##### ERROR ------------------------------------------------------------------------------------------

this is the command line:

    java -Djava.io.tmpdir=/lustre2/scratch/  -Xmx32g -jar /lustre1/tools/bin/GenomeAnalysisTK-2.2-3.jar \
-T VariantRecalibrator \
-R /lustre1/genomes/hg19/fa/hg19.fa \
-input /lustre1/workspace/Ferrari/Carrera/Analysis/UG/bpd_ug.SNP.vcf \
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 /lustre1/genomes/hg19/annotation/hapmap_3.3.hg19.sites.vcf.gz \
-resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 /lustre1/genomes/hg19/annotation/1000G_omni2.5.hg19.sites.vcf.gz \
-resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 /lustre1/genomes/hg19/annotation/dbSNP-137.chr.vcf -an QD \
-an HaplotypeScore \
-an MQRankSum \
-an FS \
-an MQ \
-an DP \
-an QD \
-an InbreedingCoeff \
-mode SNP \
-recalFile /lustre2/scratch/Carrera/Analysis2/snp.ug.recal.csv \
-tranchesFile /lustre2/scratch/Carrera/Analysis2/snp.ug.tranches \
-rscriptFile /lustre2/scratch/Carrera/Analysis2/snp.ug.plot.R \
-U ALLOW_SEQ_DICT_INCOMPATIBILITY \
--maxGaussians 6

I've already tried to decrease the --maxGaussians option to 4, I've also added --percentBad option (setting it up to 0.12, as for INDEL) but I still get the error. I've added the option -debug to see what's happening, but apparently this has been removed from GATK-2.2. Any help is appreciated... thanks

Created 2012-11-12 19:52:41 | Updated 2012-11-12 19:53:09 | Tags: vqsr

Hi,

I'm having a little trouble understanding the relationship between the -ts_filter_level and -tranche settings for VQSR. If I'm not mistaken the defaults are 99 and [100,99.9,99.0,90] respectively. When I run VQSR with these defaults, my tranches are altered because of the 99 ts filter level. I get:

##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=TruthSensitivityTranche99.00to99.90,Description="Truth sensitivity tranche level at VSQ Lod: -0.1838 <= x < 3.1102">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00+,Description="Truth sensitivity tranche level at VQS Lod < -6135.0237">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00,Description="Truth sensitivity tranche level at VSQ Lod: -6135.0237 <= x < -0.1838">

Is it odd that there are two tranches with the same ts values and different VQSLOD values? If I adjust the ts filter level to 90, I get what I originally expected to see:

##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=TruthSensitivityTranche90.00to99.00,Description="Truth sensitivity tranche level at VSQ Lod: 2.5901 <= x < 4.8133">
##FILTER=<ID=TruthSensitivityTranche99.00to99.90,Description="Truth sensitivity tranche level at VSQ Lod: -0.692 <= x < 2.5901">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00+,Description="Truth sensitivity tranche level at VQS Lod < -6.11002079587E7">

Is it just me, or does this seem to be an incompatibility between the defaults values? Which is more important, correct ts filtering or correct tranche intervals? We will at times filter based on these tranches, so I'd like to be setting them correctly. Thanks.

Ben

Created 2012-10-23 02:15:29 | Updated 2013-01-07 20:11:44 | Tags: unifiedgenotyper vqsr tranches multi-sample

Hello,

I am trying to run GATK on a sample of 119 exomes. I followed the GATK guidelines to process the fastq files. I used the following parameters to call the UnifiedGenotyper and VQSR [for SNPs]:

UnifiedGenotyper

-T UnifiedGenotyper
--output_mode EMIT_VARIANTS_ONLY
--min_base_quality_score 30
--max_alternate_alleles 5
-glm SNP 

VQSR

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /media/transcription/cipn/5.pt/ref/hapmap_3.3.hg19.sites.vcf
-resource:omni,known=false,training=true,truth=false,prior=12.0 /media/transcription/cipn/5.pt/ref/1000G_omni2.5.hg19.sites.vcf
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /media/transcription/cipn/5.pt/ref/dbsnp_135.hg19.vcf.gz
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff
-mode SNP 

I get a tranche plot, which does not look OK. The "Number of Novel Variants [1000s]" goes from -400 to 800 and the Ti/Tv ratio varies from 0.633 to 0.782 [the attach file link is not working for me and am unable to upload the plot]. Any suggestion to rectify this would be very helpful !

cheers, Rahul

Created 2012-10-11 18:12:51 | Updated 2013-01-07 19:13:46 | Tags: variantrecalibrator vqsr tranches

Hello,

I am running Variant Quality Score Recalibration on indels with the following command.

java -Xmx8g -jar /raid/software/src/GenomeAnalysisTK-1.6-9-g47df7bb/GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R /raid/references-and-indexes/hg19/bwa/hg19_lite.fa \
-input indel_output_all_chroms_combined.vcf \
--maxGaussians 4 -std 10.0 -percentBad 0.12 \
-resource:mills,known=true,training=true,truth=true,prior=12.0  /raid/Merlot/exome_pipeline_v1/ref/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-an QD -an FS -an HaplotypeScore -an ReadPosRankSum  \
--ts_filter_level 95.0 \
-mode INDEL \
-recalFile /raid2/projects/STFD/indel_output_7.recal \
-tranchesFile /raid2/projects/STFD/indel_output_7.tranches \
-rscriptFile /raid2/projects/STFD/indel_output_7.plots.R`

My tranches file reports only false positives for all tranches. When I run VQSR on SNPS, the tranches have many true positives and look similar to other tranch files reported on this site. I am wondering if anyone has similar experiences or suggestions?

Thanks