# Tagged with #varianteval 4 documentation articles | 1 announcement | 24 forum discussions

Created 2015-10-21 15:23:10 | Updated 2015-12-01 23:12:30 | Tags: varianteval genotypeconcordance collectvariantcallingmetrics indel-ratio titv truthset

## Introduction

Running through the steps involved in variant discovery (calling variants, joint genotyping and applying filters) produces a variant callset in the form of a VCF file. So what’s next? Technically, that callset is ready to be used in downstream analysis. But before you do that, we recommend running some quality control analyses to evaluate how “good” that callset is.

To be frank, distinguishing between a “good” callset and a “bad” callset is a complex problem. If you knew the absolute truth of what variants are present or not in your samples, you probably wouldn’t be here running variant discovery on some high-throughput sequencing data. Your fresh new callset is your attempt to discover that truth. So how do you know how close you got?

### Methods for variant evaluation

There are several methods that you can apply which offer different insights into the probable biological truth, all with their own pros and cons. Possibly the most trusted method is Sanger sequencing of regions surrounding putative variants. However, it is also the least scalable as it would be prohibitively costly and time-consuming to apply to an entire callset. Typically, Sanger sequencing is only applied to validate candidate variants that are judged highly likely. Another popular method is to evaluate concordance against results obtained from a genotyping chip run on the same samples. This is much more scalable, and conveniently also doubles as a quality control method to detect sample swaps. Although it only covers the subset of known variants that the chip was designed for, this method can give you a pretty good indication of both sensitivity (ability to detect true variants) and specificity (not calling variants where there are none). This is something we do systematically for all samples in the Broad’s production pipelines.

The third method, presented here, is to evaluate how your variant callset stacks up against another variant callset (typically derived from other samples) that is considered to be a truth set (sometimes referred to as a gold standard -- these terms are very close and often used interchangeably). The general idea is that key properties of your callset (metrics discussed later in the text) should roughly match those of the truth set. This method is not meant to render any judgments about the veracity of individual variant calls; instead, it aims to estimate the overall quality of your callset and detect any red flags that might be indicative of error.

### Underlying assumptions and truthiness*: a note of caution

It should be immediately obvious that there are two important assumptions being made here: 1) that the content of the truth set has been validated somehow and is considered especially trustworthy; and 2) that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set. These assumptions are not always well-supported, depending on the truth set, your callset, and what they have (or don’t have) in common. You should always keep this in mind when choosing a truth set for your evaluation; it’s a jungle out there. Consider that if anyone can submit variants to a truth set’s database without a well-regulated validation process, and there is no process for removing variants if someone later finds they were wrong (I’m looking at you, dbSNP), you should be extra cautious in interpreting results. *With apologies to Stephen Colbert.

### Validation

So what constitutes validation? Well, the best validation is done with orthogonal methods, meaning that it is done with technology (wetware, hardware, software, etc.) that is not subject to the same error modes as the sequencing process. Calling variants with two callers that use similar algorithms? Great way to reinforce your biases. It won’t mean anything that both give the same results; they could both be making the same mistakes. On the wetlab side, Sanger and genotyping chips are great validation tools; the technology is pretty different, so they tend to make different mistakes. Therefore it means more if they agree or disagree with calls made from high-throughput sequencing.

### Matching populations

Regarding the population genomics aspect: it’s complicated -- especially if we’re talking about humans (I am). There’s a lot of interesting literature on this topic; for now let’s just summarize by saying that some important variant calling metrics vary depending on ethnicity. So if you are studying a population with a very specific ethnic composition, you should try to find a truth set composed of individuals with a similar ethnic background, and adjust your expectations accordingly for some metrics.

Similar principles apply to non-human genomic data, with important variations depending on whether you’re looking at wild or domesticated populations, natural or experimentally manipulated lineages, and so on. Unfortunately we can’t currently provide any detailed guidance on this topic, but hopefully this explanation of the logic and considerations involved will help you formulate a variant evaluation strategy that is appropriate for your organism of interest.

## Variant evaluation metrics

So let’s say you’ve got your fresh new callset and you’ve found an appropriate truth set. You’re ready to look at some metrics (but don’t worry yet about how; we’ll get to that soon enough). There are several metrics that we recommend examining in order to evaluate your data. The set described here should be considered a minimum and is by no means exclusive. It is nearly always better to evaluate more metrics if you possess the appropriate data to do so -- and as long as you understand why those additional metrics are meaningful. Please don’t try to use metrics that you don’t understand properly, because misunderstandings lead to confusion; confusion leads to worry; and worry leads to too many desperate posts on the GATK forum.

### Variant-level concordance and genotype concordance

The relationship between variant-level concordance and genotype concordance is illustrated in this figure.

• Genotype concordance is a similar metric but operates at the genotype level. It allows you to evaluate, within a set of variant calls that are present in both your sample callset and your truth set, what proportion of the genotype calls have been assigned correctly. This assumes that you are comparing your sample to a matched truth set derived from the same original sample.

### Number of Indels & SNPs and TiTv Ratio

These metrics are widely applicable. The table below summarizes their expected value ranges for Human Germline Data:

Sequencing Type # of Variants* TiTv Ratio
WGS ~4.4M 2.0-2.1
WES ~41k 3.0-3.3

*for a single sample

• Number of Indels & SNPs The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.

• TiTv Ratio This metric is the ratio of transition (Ti) to transversion (Tv) SNPs. If the distribution of transition and transversion mutations were random (i.e. without any biological influence) we would expect a ratio of 0.5. This is simply due to the fact that there are twice as many possible transversion mutations than there are transitions. However, in the biological context, it is very common to see a methylated cytosine undergo deamination to become thymine. As this is a transition mutation, it has been shown to increase the expected random ratio from 0.5 to ~2.01. Furthermore, CpG islands, usually found in primer regions, have higher concentrations of methylcytosines. By including these regions, whole exome sequencing shows an even stronger lean towards transition mutations, with an expected ratio of 3.0-3.3. A significant deviation from the expected values could indicate artifactual variants causing bias. If your TiTv Ratio is too low, your callset likely has more false positives.
It should also be noted that the TiTv ratio from exome-sequenced data will vary from the expected value based upon the length of flanking sequences. When we analyze exome sequence data, we add some padding (usually 100 bases) around the targeted regions (using the -ip engine argument) because this improves calling of variants that are at the edges of exons (whether inside the exon sequence or in the promoter/regulatory sequence before the exon). These flanking sequences are not subject to the same evolutionary pressures as the exons themselves, so the number of transition and transversion mutants lean away from the expected ratio. The amount of "lean" depends on how long the flanking sequence is.

### Ratio of Insertions to Deletions (Indel Ratio)

This metric is generally evaluated after filtering for purposes that are specific to your study, and the expected value range depends on whether you're looking for rare or common variants, as summarized in the table below.

Filtering for Indel Ratio
common ~1
rare 0.2-0.5

A significant deviation from the expected ratios listed in the table above could indicate a bias resulting from artifactual variants.

## Tools for performing variant evaluation

### VariantEval

This is the GATK’s main tool for variant evaluation. It is designed to collect and calculate a variety of callset metrics that are organized in evaluation modules, which are listed in the tool doc. For each evaluation module that is enabled, the tool will produce a table containing the corresponding callset metrics based on the specified inputs (your callset of interest and one or more truth sets). By default, VariantEval will run with a specific subset of the available modules (listed below), but all evaluation modules can be enabled or disabled from the command line. We recommend setting the tool to produce only the metrics that you are interested in, because each active module adds to the computational requirements and overall runtime of the tool.

It should be noted that all module calculations only include variants that passed filtering (i.e. FILTER column in your vcf file should read PASS); variants tagged as filtered out will be ignored. It is not possible to modify this behavior. See the [example analysis]() (document in progress) for more details on how to use this tool and interpret its output.

### GenotypeConcordance

This tool calculates -- you’ve guessed it -- the genotype concordance between callsets. In earlier versions of GATK, GenotypeConcordance was itself a module within VariantEval. It was converted into a standalone tool to enable more complex genotype concordance calculations.

### Picard tools

The Picard toolkit includes two tools that perform similar functions to VariantEval and GenotypeConcordance, respectively called CollectVariantCallingMetrics and GenotypeConcordance. Both are relatively lightweight in comparison to their GATK equivalents; their functionalities are more limited, but they do run quite a bit faster. See the [example analysis]() (document in progress) of CollectVariantCallingMetrics for details on its use and data interpretation. Note that in the coming months, the Picard tools are going to be integrated into the next major version of GATK, so at that occasion we plan to consolidate these two pairs of homologous tools to eliminate redundancy.

### Which tool should I use?

We recommend Picard's version of each tool for most cases. The GenotypeConcordance tools provide mostly the same information, but Picard's version is preferred by Broadies. Both VariantEval and CollectVariantCallingMetrics produce similar metrics, however the latter runs faster and is scales better for larger cohorts. By default, CollectVariantCallingMetrics stratifies by sample, allowing you to see the value of relevant statistics as they pertain to specific samples in your cohort. It includes all metrics discussed here, as well as a few more. On the other hand, VariantEval provides many more metrics beyond the minimum described here for analysis. It should be noted that none of these tools use phasing to determine metrics.

So when should I use CollectVariantCallingMetrics?

• If you have a very large callset
• If you want to look at the metrics discussed here and not much else
• If you want your analysis back quickly

When should I use VariantEval?

• When you require a more detailed analysis of your callset
• If you need to stratify your callset by another factor (allele frequency, indel size, etc.)
• If you need to compare to multiple truth sets at the same time

Created 2015-09-29 16:58:59 | Updated 2015-12-01 23:22:47 | Tags: varianteval metrics collectvariantcallingmetrics titv

## Context

This document will walk you through use of GATK's VariantEval tool. VariantEval allows for a lot of customizability, enabling an enhanced analysis of your callset through stratification, use of additional evaluation modules, and the ability to pass in multiple truth sets. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.

## Example Analysis

java -jar GenomeAnalysisTK.jar \
-T VariantEval \
-R reference.b37.fasta \
-eval SampleVariants.vcf \
-D dbsnp_138.b37.excluding_sites_after_129.vcf \
-noEV -EV CompOverlap -EV IndelSummary -EV TiTvVariantEvaluator -EV CountVariants -EV MultiallelicSummary \
-o SampleVariants_Evaluation.eval.grp

This command specifies the tool (VariantEval), input files, evaluation modules to be used, and an output file to write the results to. The output will be in the form of a GATKReport.

### Input Files

• -eval: a .vcf file containing your sample(s)' variant data you wish to evaluate. The example shown here uses a whole-genome sequenced rare variant association study performed on >1500 samples. You can specify multiple files to evaluate with additional -eval lines.
• -D: a dbSNP .vcf to provide the tool a reference of known variants, which can be found in the GATK bundle
• -R: a reference sequence .fasta

### Evaluation Modules

For our example command, we will simplify our analysis and examine results using the following minimum standard modules: CompOverlap, IndelSummary, TiTvVariantEvaluator, CountVariants, & MultiallelicSummary. These modules will provide a reasonable assessment of variant qualities while reducing the computational burden in comparison to running the default modules. In the data we ran here, >1500 whole-genome-sequenced samples, this improved the run time by 5 hours and 20 minutes compared to using the default modules, which equates to a 12% time reduction. In order to do this, all default modules are removed with -noEV, then the minimum standard modules are added back in. This tool uses only at variants that have passed all filtration steps to calculate metrics.

• CompOverlap: gives concordance metrics based on the overlap between the evaluation and comparison file
• CountVariants: counts different types (SNP, insertion, complex, etc.) of variants present within your evaluation file and gives related metrics
• IndelSummary: gives metrics related to insertions and deletions (count, multiallelic sites, het-hom ratios, etc.)
• MultiallelicSummary: gives metrics relevant to multiallelic variant sites, including amount, ratio, and TiTv
• TiTvVariantEvaluator: gives the number and ratio of transition and transversion variants for your evaluation file, comparison file, and ancestral alleles
• MetricsCollection: includes all minimum metrics discussed in this article (the one you are currently reading). Runs by default if CompOverlap, IndelSummary, TiTvVariantEvaluator, CountVariants, & MultiallelicSummary are run as well. (included in the nightly build for immediate use or in the 3.5 release of GATK)

### Example Output

Here we see an example of the table generated by the CompOverlap evaluation module. The field concordantRate is highlighted as it is one of the metrics we examine for quality checks. Each table generated by the sample call is in the attached files list at the end of this document, which you are free to browse at your leisure.

It is important to note the stratification by novelty, seen in this and all other tables for this example. The row for "novel" includes all variants that are found in SampleVariants.vcf but not found in the known variants file. By default, your known variants are in dbSNP. However, if you would like to specify a different known set of variants, you can pass in a -comp file, and call -knownName on it. (See the VariantEval tool documentation for more information) The "known" row includes all variants found in SampleVariants.vcf and the known variants file. "All" totals the "known" and "novel" rows. This novelty stratification is done by default, but many other stratifications can be specified; see tool documentation for more information.

Compiled in the below table are all of the metrics taken from various tables that we will use to ascertain the quality of the analysis.

### Metrics Analysis

• concordantRate Referring to percent concordance, this metric is found in the CompOverlap table. The concordance given here is site-only; for concordance which also checks the genotype, use GenotypeConcordance from Picard or GATK Your default truth set is dbSNP, though additional truth sets can be passed into VariantEval using the -comp argument. In the case used here, we expect (and observe) a majority of overlap between eval and dbSNP. The latter contains a multitude of variants and is not highly regulated, so matching a high number of eval variants to it is quite likely. Please note: As dbSNP is our default truth set (for comparison), and our default known (for novelty determination), you will see 0 in the novel concordantRate column. If you are interested in knowing the novel concordantRate, you must specify a truth set different from the set specified as known.

• nSNPs/n_SNPs & nIndels/n_indels The number of SNPs are given in CountVariants, MultiallelicSummary, and IndelSummary; the number of indels are given in MultiallelicSummary and IndelSummary. Different numbers are seen in each table for the same metric due to the way in which each table calculates the metric. Take the example to the right: each of the four samples give their two major alleles and though all samples have a variant at this particular loci, all are slightly different in their calls, making this a multiallelic site.
IndelSummary counts all variants separately at a multiallelic site; It thus counts 2 SNPs (one T and one C) and 1 indel (a deletion) across all samples. CountVariants and MultiallelicSummary, on the other hand, count multiallelic sites as a single variant, while still counting indels and SNPs as separate variants. Thus, they count one indel and one SNP. If you wanted to stratify by sample, all the tables would agree on the numbers for samples 1, 2, & 4, as they are biallelic sites. Sample 3 is multiallelic, and IndelSummary would count 2 SNPs, whereas CountVariants and MultiallleicSummary would count 1 SNP. Though shown here on a very small scale, the same process occurs when analyzing a whole genome or exome of variants.
Our resulting numbers (~56 million SNPs & ~8-11 million indels) are for a cohort of >1500 whole-genome sequencing samples. Therefore, although the numbers are quite large in comparison to the ~4.4 million average variants found in Nature's 2015 paper, they are within reason for a large cohort of whole genome samples.

• Indel Ratio The indel ratio is seen twice in our tables: as insertion_to_deletion_ratio under IndelSummary, and under CountVariants as insertionDeletionRatio. Each table gives a different ratio, due to the differences in calculating indels as discussed in the previous section. In our particular sample data set, filters were run to favor detection of more rare variants. Thus the indel ratios of the loci-based table (IndelSummary; 0.77 & 0.69) are closer to the rare ratio than the expected normal.

• tiTvRatio While the MultiallelicSummary table gives a value for the TiTv of multiallelic sites, we are more interested in the overall TiTv, given by the TiTvVariantEvaluator. The value seen here (2.10 - 2.19) are on the higher edge of acceptable (2.0-2.1), but are still within reason.

## Note on speed performance

The purpose of running the analysis with the minimum standard evaluation modules is to minimize the time spent running VariantEval. Reducing the number of evaluation modules has some effects on the total runtime; depending on the additional specifications given (stratifications, multiple -comp files, etc.), running with the minimum standard evaluation modules can reduce the runtime by 10-30% in comparison to running the default evaluation modules. Further reducing the runtime can be achieved through multithreading, using the -nt argument.

Created 2015-09-23 12:06:10 | Updated 2015-12-01 23:20:26 | Tags: varianteval genotypeconcordance metrics collectvariantcallingmetrics titv

## Context

This document will walk you through use of Picard's CollectVariantCallingMetrics tool, an excellent tool for large callsets, especially if you need your results quickly and do not require many additional metrics to those described here. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.

## Example Use

### Command

java -jar picard.jar CollectVariantCallingMetrics \
INPUT=CEUtrio.vcf \
OUTPUT=CEUtrioMetrics \
DBSNP=dbsnp_138.b37.excluding_sites_after_129.vcf 
• INPUT The CEU trio (NA12892, NA12891, and 12878) from the 1000 Genomes Project is the input chosen for this example. It is the callset that we wish to examine the metrics on, and thus this is the field where you would specify the .vcf file containing your sample(s)'s variant calls.
• OUTPUT The output for this command will be written to two files named CEUtrioMetrics.variant_calling_summary_metrics and CEUtrioMetrics.variant_calling_detail_metrics, hereafter referred to as "summary" and "detail", respectively. The specification given in this field is applied as the name of the out files; the file extensions are provided by the tool itself.
• DBSNP The last required input to run this tool is a dbSNP file. The one used here is available in the current GATK bundle. CollectVariantCallingMetrics utilizes this dbSNP file as a base of comparison against the sample(s) present in your vcf.

### Getting Results

After running the command, CollectVariantCallingMetrics will return both a detail and a summary metrics file. These files can be viewed as a text file if needed, or they can be read in as a table using your preferred spreadsheet viewer (e.g. Excel) or scripting language of your choice (e.g. python, R, etc.) The files contain headers and are tab-delimited; the commands for reading in the tables in RStudio are found below. (Note: Replace "~/path/to/" with the path to your output files as needed.)

summary <- read.table("~/path/to/CEUtrioMetrics.variant_calling_summary_metrics", header=TRUE, sep="\t")
detail <- read.table("~/path/to/CEUtrioMetrics.variant_calling_detail_metrics", header=TRUE, sep="\t")
• Summary The summary metrics file will contain a single row of data for each metric, taking into account all samples present in your INPUT file.
• Detail The detail metrics file gives a breakdown of each statistic by sample. In addition to all metrics covered in the summary table, the detail table also contains entries for SAMPLE_ALIAS and HET_HOMVAR_RATIO. In the example case here, the detail file will contain metrics for the three different samples, NA12892, NA12891, and NA12878.

### Analyzing Results

*Concatenated in the above table are the detail file's (rows 1-3) and the summary file's (row 4) relevant metrics; for full output table, see attached image file.

• Number of Indels & SNPs This tool collects the number of SNPs (single nucleotide polymorphisms) and indels (insertions and deletions) as found in the variants file. It counts only biallelic sites and filters out multiallelic sites. Many factors affect these counts, including cohort size, relatedness between samples, strictness of filtering, ethnicity of samples, and even algorithm improvement due to updated software. While this metric alone is insufficient to evaluate your variants, it does provide a good baseline. It is reassuring to see that across the three related samples, we saw very similar numbers of SNPs and indels. It could be cause for concern if a particular sample had significantly more or fewer variants than the rest.
• Indel Ratio The indel ratio is determined to be the total number of insertions divided by the total number of deletions; this tool does not include filtered variants in this calculation. Usually, the indel ratio is around 1, as insertions occur typically as frequently as deletions. However, in rare variant studies, indel ratio should be around 0.2-0.5. Our samples have an indel ratio of ~0.95, indicating that these variants are not likely to have a bias affecting their insertion/deletion ratio.
• TiTv Ratio This metric is the ratio of transition (Ti) to transversion (Tv) mutations. For whole genome sequencing data, TiTv should be ~2.0-2.1, whereas whole exome sequencing data will have a TiTv ratio of ~3.0-3.31. In the case of the CEU trio of samples, the TiTv of ~2.08 and ~1.91 are within reason, and this variant callset is unlikely to have a bias affecting its transition/transversion ratio.

Created 2013-03-18 20:25:42 | Updated 2013-03-18 20:26:03 | Tags: varianteval intermediate tooltips

VariantEval accepts two types of modules: stratification and evaluation modules.

• Stratification modules will stratify (group) the variants based on certain properties.
• Evaluation modules will compute certain metrics for the variants

### CpG

CpG is a three-state stratification:

• The locus is a CpG site ("CpG")
• The locus is not a CpG site ("non_CpG")
• The locus is either a CpG or not a CpG site ("all")

A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.

### EvalRod

EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.

### Sample

Sample is an N-state stratification, where N is the number of samples in the eval files.

### Filter

Filter is a three-state stratification:

• The locus passes QC filters ("called")
• The locus fails QC filters ("filtered")
• The locus either passes or fails QC filters ("raw")

### FunctionalClass

FunctionalClass is a four-state stratification:

• The locus is a synonymous site ("silent")
• The locus is a missense site ("missense")
• The locus is a nonsense site ("nonsense")
• The locus is of any functional class ("any")

### CompRod

CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.

### Degeneracy

Degeneracy is a six-state stratification:

• The underlying base position in the codon is 1-fold degenerate ("1-fold")
• The underlying base position in the codon is 2-fold degenerate ("2-fold")
• The underlying base position in the codon is 3-fold degenerate ("3-fold")
• The underlying base position in the codon is 4-fold degenerate ("4-fold")
• The underlying base position in the codon is 6-fold degenerate ("6-fold")
• The underlying base position in the codon is degenerate at any level ("all")

### JexlExpression

JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]

### Novelty

Novelty is a three-state stratification:

• The locus overlaps the knowns comp track (usually the dbSNP track) ("known")
• The locus does not overlap the knowns comp track ("novel")
• The locus either overlaps or does not overlap the knowns comp track ("all")

### CountVariants

CountVariants is an evaluation module that computes the following metrics:

Metric Definition
nProcessedLoci Number of processed loci
nCalledLoci Number of called loci
nRefLoci Number of reference loci
nVariantLoci Number of variant loci
variantRate Variants per loci rate
variantRatePerBp Number of variants per base
nSNPs Number of snp loci
nInsertions Number of insertion
nDeletions Number of deletions
nComplex Number of complex loci
nNoCalls Number of no calls loci
nHets Number of het loci
nHomRef Number of hom ref loci
nHomVar Number of hom var loci
nSingletons Number of singletons
heterozygosity heterozygosity per locus rate
heterozygosityPerBp heterozygosity per base pair
hetHomRatio heterozygosity to homozygosity ratio
indelRate indel rate (insertion count + deletion count)
indelRatePerBp indel rate per base pair
deletionInsertionRatio deletion to insertion ratio

### CompOverlap

CompOverlap is an evaluation module that computes the following metrics:

Metric Definition
nEvalSNPs number of eval SNP sites
nCompSNPs number of comp SNP sites
novelSites number of eval sites outside of comp sites
nVariantsAtComp number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)
compRate percentage of eval sites at comp sites
nConcordant number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)
concordantRate the concordance rate

#### Understanding the output of CompOverlap

A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):

 $grep -v '##' dbsnp.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132 Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP: $ grep -v '##' eval_correct_allele.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
1       10327   .       T       C       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059

Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:

 $grep -v '##' eval_incorrect_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059 Running VariantEval with just the CompOverlap module: $ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -L 1:10327 \ -B:dbsnp,VCF dbsnp.vcf \ -B:eval_correct_allele,VCF eval_correct_allele.vcf \ -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \ -noEV \ -EV CompOverlap \ -o eval.table We find that the eval.table file contains the following: $ grep -v '##' eval.table | column -t
CompOverlap  CompRod  EvalRod                JexlExpression  Novelty  nEvalVariants  nCompVariants  novelSites  nVariantsAtComp  compRate      nConcordant  concordantRate
CompOverlap  dbsnp    eval_correct_allele    none            all      1              1              0           1                100.00000000  1            100.00000000
CompOverlap  dbsnp    eval_correct_allele    none            known    1              1              0           1                100.00000000  1            100.00000000
CompOverlap  dbsnp    eval_correct_allele    none            novel    0              0              0           0                0.00000000    0            0.00000000
CompOverlap  dbsnp    eval_incorrect_allele  none            all      1              1              0           1                100.00000000  0            0.00000000
CompOverlap  dbsnp    eval_incorrect_allele  none            known    1              1              0           1                100.00000000  0            0.00000000
CompOverlap  dbsnp    eval_incorrect_allele  none            novel    0              0              0           0                0.00000000    0            0.00000000

As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.

### TiTvVariantEvaluator

TiTvVariantEvaluator is an evaluation module that computes the following metrics:

Metric Definition
nTi number of transition loci
nTv number of transversion loci
tiTvRatio the transition to transversion ratio
nTiInComp number of comp transition sites
nTvInComp number of comp transversion sites
TiTvRatioStandard the transition to transversion ratio for comp sites

Created 2012-08-20 18:52:48 | Updated 2012-08-23 14:11:29 | Tags: unifiedgenotyper baserecalibrator combinevariants haplotypecaller selectvariants varianteval release-notes

## Base Quality Score Recalibration

• Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
• Implemented support for SOLiD no call strategies other than throwing an exception.
• Fixed smoothing in the BQSR bins.
• Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

## Unified Genotyper

• Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
• UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
• Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
• In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
• Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

## Haplotype Caller

• Added LowQual filter to the output when appropriate.
• Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
• Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
• Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
• Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
• Fixed bug where non-standard bases from the reference would cause errors.
• Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

• Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
• Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
• Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

## Variant Eval

• Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
• Fixed incorrect allele counting in IndelSummary evaluation.

## Combine Variants

• Now outputs the first non-MISSING QUAL, instead of the maximum.
• Now supports multi-threaded running (with the -nt argument).

## Select Variants

• Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
• No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
• If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).

## Miscellaneous

• GATK now generates a proper error when a gzipped FASTA is passed in.
• Various improvements throughout the BCF2-related code.
• Removed various parallelism bottlenecks in the GATK.
• Added support of X and = CIGAR operators to the GATK.
• Catch NumberFormatExceptions when parsing the VCF POS field.
• Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
• Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
• We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
• Added support for handling complex events in ValidateVariants.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Created 2015-11-25 19:52:15 | Updated | Tags: varianteval interpretation

I used "varianteval" module to evaluate 33 samples of mine with a ref-data across 23 chromosomes. I expected the sum of nHomVar + nHets = nSNPs but is not!!!!! I don't know why the number of "nHets" is too much large? any thoughts or clue? thanks

allSample nVariantLoci nSNPs nInsertions nDeletions nHets nHomRef nHomVar nSingletons hetHomRatio all_chr1 51144 47364 1719 2061 152157 672 33997 22615 4.48 all_chr2 37491 34738 1246 1507 102477 257 19140 18172 5.35 all_chr3 35139 32591 1163 1385 98830 386 18342 16502 5.39 all_chr4 26399 24419 864 1116 71321 198 14864 12911 4.80 all_chr5 28058 25933 958 1167 76321 290 14099 13391 5.41 all_chr6 31222 28940 1009 1273 82483 243 16410 15084 5.03 all_chr7 28351 26331 907 1113 80464 221 14682 13448 5.48 all_chr8 22551 20968 701 882 63569 224 13134 10650 4.84 all_chr9 22478 20987 679 812 60268 198 11341 11055 5.31 all_chr10 27740 25691 924 1125 79158 182 15035 12875 5.26 all_chr11 24012 22309 762 941 66762 197 14378 11521 4.64 all_chr12 25239 23310 832 1097 71793 218 15187 11800 4.73 all_chr13 12494 11547 388 559 34506 113 7156 5945 4.82 all_chr14 15825 14640 534 651 42887 123 9137 7689 4.69 all_chr15 17554 16249 585 720 49245 133 10203 8329 4.83 all_chr16 22966 21529 656 781 63713 165 13000 10825 4.90 all_chr17 23983 22285 768 930 71715 236 15978 10776 4.49 all_chr18 9741 8935 374 432 25755 93 4743 4793 5.43 all_chr19 26262 24448 792 1022 81650 359 17700 11176 4.61 all_chr20 13068 12133 431 504 37924 155 7633 6197 4.97 all_chr21 7722 7220 212 290 20928 124 4497 3807 4.65 all_chr22 14151 13188 406 557 42289 156 8814 6180 4.80 all_chr23 5330 4869 221 240 10552 92 5177 2772 2.04

Created 2015-07-16 18:45:45 | Updated | Tags: varianteval sensitivity specificity

I have 33 samples (VCF files) resulted from HaplotypeCaller and like to evaluate them using dbSNP as a comp to get concordance_rate, sensitivity, and specificity. One way is to evaluate each sample separately. Another approach is to combine all 33 files using "CombineGVCFs" and then evaluate one VCF file. Which of these 2 methods do you think is feasible and acceptable for variant evaluation? why? Thanks

Created 2015-07-01 19:10:26 | Updated | Tags: varianteval specificity

In the following article you explained about sensitivity, descripency rate and concordant rate. What about Specificity rate?

Could you please introduce an article or a reference to explain each parameters measures in the output of "VariantEval". Thanks

Created 2015-04-14 09:02:46 | Updated | Tags: varianteval

The VariantEval provides many useful statistics for evaluation of the callsets. However, the documentations I could found in GATK website were somewhat out of date, resulting in some columns not easy to understand or guess. I have two questions:

1. What are the meanings of the nMNPs, nComplex, nSymbolic, nMixed, nHomDerived in a CountVariants table in a VariantEval report?
2. How can I found some formula or explanation of all the calculation in VariantEval?

Thanks a lot! Hiu

Created 2015-04-10 16:14:16 | Updated | Tags: varianteval

How can I get more information about output of Variant-Evaluation tool? Is there any document or article for more details? Particularly more detail about specificity, sensitivity, TiTvratio, hethomratio, etc. Thanks.

Created 2015-01-27 20:07:05 | Updated 2015-01-27 20:09:02 | Tags: varianteval r

VariantEval follows a strict format that is human readable but also the top few lines of each table more than hint that it's ready to be parsed by something else. Been wondering what that is, looks like Python... Is there a nice quick way to load all of these data into something like R or Python? I can imagine using grep to put each table into a file and load that into R but if the output was designed for a certain tool, it would be great to use that.

Thanks as always for the wonderful tools!

Example, what language or tool is this stuff for?

#:GATKTable:11:12:%s:%s:%s:%s:%s:%d:%d:%d:%.2f:%d:%.2f:;

#:GATKTable:CompOverlap:The overlap between eval and comp sites

Created 2014-07-28 22:08:52 | Updated | Tags: combinevariants varianteval ignorefilter

I am running CombineVariants followed by VariantEval on two VCF files (GATK version 3.1-1). I noticed that none of my "set=FilteredInAll" variants are included in the output of VariantEval. I suspect this is because the FILTER column is not "PASS" because the variants were not "PASS" in either of the input VCF files.

Is there an argument to pass into VariantEval so that it does not filter variants out based on the FILTER column? I was using the arguments from a guide article (http://gatkforums.broadinstitute.org/discussion/48/using-varianteval). I saw that in the comments you mentioned an --ignoreFilters argument, but when I tried to use it I got the GATK error: Argument with name 'ignoreFilters' isn't defined.

I can just change the FILTER column entries in the combined VCF file, but it seems like a problem that others might have as well.

Thank you!

Created 2014-07-11 11:46:43 | Updated | Tags: varianteval stratification

Using the following command line: java -Xmx2g -jar /media/data/Applications/GenomeAnalysisTK.jar -R /media/data/GATK/hg19/ucsc.hg19.fasta -T VariantEval -noEV -ST FunctionalClass -o eval759_FunctionalClassreport --eval Sample_759.filtered.vcf -D /media/data/GATK/hg19/dbsnp_138.hg19.vcf The tables show no data for missense, nonsense etc Eg # :GATKReport.v1.1:8 # :GATKTable:12:12:%s:%s:%s:%s:%s:%s:%d:%d:%d:%.2f:%d:%.2f:; # :GATKTable:CompOverlap:The overlap between eval and comp sites CompOverlap CompRod EvalRod FunctionalClass JexlExpression Novelty nEvalVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval all none all 2850055 55006 2795049 98.07 2531746 90.58 CompOverlap dbsnp eval all none known 2795049 0 2795049 100 2531746 90.58 CompOverlap dbsnp eval all none novel 55006 55006 0 0 0 0 CompOverlap dbsnp eval missense none all 0 0 0 0 0 0 CompOverlap dbsnp eval missense none known 0 0 0 0 0 0 CompOverlap dbsnp eval missense none novel 0 0 0 0 0 0 CompOverlap dbsnp eval nonsense none all 0 0 0 0 0 0 CompOverlap dbsnp eval nonsense none known 0 0 0 0 0 0 CompOverlap dbsnp eval nonsense none novel 0 0 0 0 0 0 CompOverlap dbsnp eval silent none all 0 0 0 0 0 0 CompOverlap dbsnp eval silent none known 0 0 0 0 0 0 CompOverlap dbsnp eval silent none novel 0 0 0 0 0 0 This is my first foray into GATK but I thought everything in the command was OK. Is there something wrong with selection of the specific dbSNP file or reference that the SNP are not beng split into classes? Many thanks. Created 2014-04-17 16:46:28 | Updated | Tags: varianteval Hi, I have genotype calls on an exome sequencing dataset, and have used the GATK TiTvVariantEvaluator tool to estimate TiTv ratios for my samples. I'm however having trouble understanding the "TiTvRatioStandard" column, and how I interpret it. My TiTv ratios for known variants are 2.56, below the generally accepted cutoff of 2.8. The TiTv ratios I get for novel variants is also very low (~1.25), which should bring up a red flag, but the values in the "TiTvRatioStandard" column for novel variants is 1.04, and for known variants is 2.56. I haven't been able to understand what these values mean, and whether I should use them as threshold/quality measures. If someone could point me to more information on this, it would be very helpful. Thanks, Created 2013-11-20 15:52:46 | Updated | Tags: varianteval stratification I'd like to be able to perform stratifications in a multi sample vcf, by values that are in the format fields. Almost all of the existing stratifications are based on site specific information rather than sample specific ones. One stratification in particular that I would like to perform is by ReadDepth. I would like to be able to differentiate for instance, all samples with ReadDepth greater than 20. This works in single sample vcfs, but it produces strange results in ones with multiple samples, since each VariantContext contains multiple genotypes. Melting my vcfs and reporting multiple lines for each position seems possible, but ugly. Splitting vcfs so that each sample is in it's own vcf is also possible and ugly. What is the recommended method for dealing with this sort of stratification? Created 2013-08-28 21:25:28 | Updated 2013-08-28 21:41:49 | Tags: validatevariants varianteval vcf strelka Strelka produces vcf files that GATK has issues with. The files pass vcftools validation, which according to the docs is the official validation, they do not pass ValidateVariants. VariantEval can't read them either. I'm unsure where the bug lives. vcf file looks like this: ##fileformat=VCFv4.1 ##fileDate=20130801 ##source=strelka ##source_version=2.0.8 ##startTime=Thu Aug 1 15:23:54 2013 ##reference=file:///xchip/cga_home/louisb/reference/human_g1k_v37_decoy.fasta ##contig=<ID=1,length=249250621> ##contig=<ID=2,length=243199373> ##contig=<ID=3,length=198022430> ##contig=<ID=4,length=191154276> ##contig=<ID=5,length=180915260> ##contig=<ID=6,length=171115067> ##contig=<ID=7,length=159138663> ##contig=<ID=8,length=146364022> ##contig=<ID=9,length=141213431> ##contig=<ID=10,length=135534747> ##contig=<ID=11,length=135006516> ##contig=<ID=12,length=133851895> ##contig=<ID=13,length=115169878> ##contig=<ID=14,length=107349540> ##contig=<ID=15,length=102531392> ##contig=<ID=16,length=90354753> ##contig=<ID=17,length=81195210> ##contig=<ID=18,length=78077248> ##contig=<ID=19,length=59128983> ##contig=<ID=20,length=63025520> ##contig=<ID=21,length=48129895> ##contig=<ID=22,length=51304566> ##contig=<ID=X,length=155270560> ##contig=<ID=Y,length=59373566> ##contig=<ID=MT,length=16569> ##contig=<ID=GL000207.1,length=4262> ##contig=<ID=GL000226.1,length=15008> ##contig=<ID=GL000229.1,length=19913> ##contig=<ID=GL000231.1,length=27386> ##contig=<ID=GL000210.1,length=27682> ##contig=<ID=GL000239.1,length=33824> ##contig=<ID=GL000235.1,length=34474> ##contig=<ID=GL000201.1,length=36148> ##contig=<ID=GL000247.1,length=36422> ##contig=<ID=GL000245.1,length=36651> ##contig=<ID=GL000197.1,length=37175> ##contig=<ID=GL000203.1,length=37498> ##contig=<ID=GL000246.1,length=38154> ##contig=<ID=GL000249.1,length=38502> ##contig=<ID=GL000196.1,length=38914> ##contig=<ID=GL000248.1,length=39786> ##contig=<ID=GL000244.1,length=39929> ##contig=<ID=GL000238.1,length=39939> ##contig=<ID=GL000202.1,length=40103> ##contig=<ID=GL000234.1,length=40531> ##contig=<ID=GL000232.1,length=40652> ##contig=<ID=GL000206.1,length=41001> ##contig=<ID=GL000240.1,length=41933> ##contig=<ID=GL000236.1,length=41934> ##contig=<ID=GL000241.1,length=42152> ##contig=<ID=GL000243.1,length=43341> ##contig=<ID=GL000242.1,length=43523> ##contig=<ID=GL000230.1,length=43691> ##contig=<ID=GL000237.1,length=45867> ##contig=<ID=GL000233.1,length=45941> ##contig=<ID=GL000204.1,length=81310> ##contig=<ID=GL000198.1,length=90085> ##contig=<ID=GL000208.1,length=92689> ##contig=<ID=GL000191.1,length=106433> ##contig=<ID=GL000227.1,length=128374> ##contig=<ID=GL000228.1,length=129120> ##contig=<ID=GL000214.1,length=137718> ##contig=<ID=GL000221.1,length=155397> ##contig=<ID=GL000209.1,length=159169> ##contig=<ID=GL000218.1,length=161147> ##contig=<ID=GL000220.1,length=161802> ##contig=<ID=GL000213.1,length=164239> ##contig=<ID=GL000211.1,length=166566> ##contig=<ID=GL000199.1,length=169874> ##contig=<ID=GL000217.1,length=172149> ##contig=<ID=GL000216.1,length=172294> ##contig=<ID=GL000215.1,length=172545> ##contig=<ID=GL000205.1,length=174588> ##contig=<ID=GL000219.1,length=179198> ##contig=<ID=GL000224.1,length=179693> ##contig=<ID=GL000223.1,length=180455> ##contig=<ID=GL000195.1,length=182896> ##contig=<ID=GL000212.1,length=186858> ##contig=<ID=GL000222.1,length=186861> ##contig=<ID=GL000200.1,length=187035> ##contig=<ID=GL000193.1,length=189789> ##contig=<ID=GL000194.1,length=191469> ##contig=<ID=GL000225.1,length=211173> ##contig=<ID=GL000192.1,length=547496> ##contig=<ID=NC_007605,length=171823> ##contig=<ID=hs37d5,length=35477943> ##content=strelka somatic indel calls ##germlineIndelTheta=0.0001 ##priorSomaticIndelRate=1e-06 ##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal"> ##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI"> ##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}."> ##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT"> ##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT"> ##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states"> ##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence"> ##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele"> ##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele"> ##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interupted homopolymer length intersecting with the indel"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation"> ##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel."> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1"> ##FORMAT=<ID=DP2,Number=1,Type=Integer,Description="Read depth for tier2"> ##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2"> ##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2"> ##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2"> ##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases"> ##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases"> ##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases"> ##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence"> ##FILTER=<ID=iHpol,Description="Indel overlaps an interupted homopolymer longer than 14x in the reference sequence"> ##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3"> ##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30"> ##cmdline=/xchip/cga_home/louisb/Strelka/strelka_workflow_1.0.7/libexec/consolidateResults.pl --config=/xchip/cga/benchmark/testing/full-run/somatic-benchmark/spiked/Strelka_NDEFGHI_T12345678_0.8/config/run.config.ini #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR 1 797126 . GTAAT G . PASS IC=1;IHP=2;NT=ref;QSI=56;QSI_NT=56;RC=2;RU=TAAT;SGT=ref- >het;SOMATIC;TQSI=1;TQSI_NT=1 DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50 47:47:48,49:0,0:3,3:48.72:0.00:0.00 62:62:36,39:17,19:9,9:42.49:0.21:0.00 The output I get from ValidateVariants is java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T ValidateVariants --variant strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,289 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15 INFO 17:19:45,291 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:19:45,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:19:45,295 HelpFormatter - Program Args: -T ValidateVariants --variant strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,295 HelpFormatter - Date/Time: 2013/08/28 17:19:45 INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,300 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF INFO 17:19:45,412 GenomeAnalysisEngine - Strictness is SILENT INFO 17:19:45,513 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:19:45,533 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf INFO 17:19:45,615 GenomeAnalysisEngine - Preparing for traversal INFO 17:19:45,627 GenomeAnalysisEngine - Done preparing for traversal INFO 17:19:45,627 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:19:45,627 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:19:46,216 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-1-g42d771f): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: File /Users/louisb/Workspace/strelkaVcfDebug/strelka1.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 1:797126 are not observed at all in the sample genotypes ##### ERROR ------------------------------------------------------------------------------------------ output from VariantEval is: java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T VariantEval --eval strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:15:44,333 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:15:44,335 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15 INFO 17:15:44,335 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:15:44,335 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:15:44,339 HelpFormatter - Program Args: -T VariantEval --eval strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:15:44,339 HelpFormatter - Date/Time: 2013/08/28 17:15:44 INFO 17:15:44,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:15:44,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:15:44,349 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF INFO 17:15:44,476 GenomeAnalysisEngine - Strictness is SILENT INFO 17:15:44,603 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:15:44,623 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf INFO 17:15:44,710 GenomeAnalysisEngine - Preparing for traversal INFO 17:15:44,722 GenomeAnalysisEngine - Done preparing for traversal INFO 17:15:44,722 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:15:44,723 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:15:44,831 VariantEval - Creating 3 combinatorial stratification states INFO 17:15:45,382 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}] at org.broadinstitute.sting.gatk.walkers.varianteval.evaluators.CountVariants.update1(CountVariants.java:201) at org.broadinstitute.sting.gatk.walkers.varianteval.util.EvaluationContext.apply(EvaluationContext.java:88) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:455) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:124) at org.broadinstitute.sting.gatk.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:267)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-1-g42d771f):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR
##### ERROR MESSAGE: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
##### ERROR ------------------------------------------------------------------------------------------

Created 2013-07-26 21:27:21 | Updated | Tags: varianteval

Hi GATK Team,

I am heavy user of the VariantEval evaluators (particularly GenotypeConcordance) and tracked some unexpected results to the Sample stratification. What is the motivation for not stratifying the comp RODs by the sample? This seems to be a very conscious choice so I was hoping to understand the background of that choice

The relevant section of VariantEval.java is:

for ( final RodBinding<VariantContext> compRod : comps ) {
// no sample stratification for comps
final HashMap<String, Collection<VariantContext>> compSetHash = compVCs.get(compRod);
final Collection<VariantContext> compSet = (compSetHash == null || compSetHash.size() == 0) ? Collections.<VariantContext>emptyList() : compVCs.get(compRod).values().iterator().next();

The effect for me is that many spurious genotypes get included in cases where is a comp variant, but not eval variant.

Thanks!

Created 2013-06-21 21:44:03 | Updated | Tags: haplotypecaller varianteval gatkreport

Hi,

I just finished running a fairly large number of WGS samples through HaplotypeCaller and I've been using VariantEval to look at some summary stats on these samples. I've noticed that under '#:GATKTable:VariantSummary:1000 Genomes Phase I summary of variants table' there's a section on structural variations and that apparently I'm getting about 3500 in one of my samples. Here's the actual section of the table in question:

#:GATKTable:20:3:%s:%s:%s:%s:%s:%d:%d:%d:%.2f:%s:%d:%.2f:%.1f:%d:%s:%d:%.1f:%d:%s:%d:;
#:GATKTable:VariantSummary:1000 Genomes Phase I summary of variants table
VariantSummary  CompRod  EvalRod  JexlExpression  Novelty  nSamples  nProcessedLoci  nSNPs    TiTvRatio  SNPNoveltyRate  nSNPsPerSample  TiTvRatioPerSample  SNPDPPerSample  nIndels  IndelNoveltyRate  nIndelsPerSample  IndelDPPerSample  nSVs  SVNoveltyRate  nSVsPerSample
VariantSummary  dbsnp    vcf1     none            all             1      3095693981  3446166       2.08            1.34         3446166                2.08             0.0   962028             15.33            962028               0.0  3282          73.58           3282
VariantSummary  dbsnp    vcf1     none            known           1      3095693981  3399907       2.08            0.00         3399907                2.08             0.0   814506              0.00            814506               0.0   867           0.00            867
VariantSummary  dbsnp    vcf1     none            novel           1      3095693981    46259       1.71          100.00           46259                1.71             0.0   147522            100.00            147522               0.0  2415         100.00           2415

I didn't think that HaplotypeCaller even looked for structural variations, so I tried to find these structural variations in the VCF, hoping they were encoded as described here and I couldn't find anything. Could someone tell me why VariantEval is showing a number of structural variations but the actual VCF isn't finding any? Does VariantEval just interpret a sufficiently large indel as a SV? If so, I can understand why it may call some structural variations considering there are indels longer than 1k bp in the indels of the sample.

Thanks,

Grant

Created 2013-06-10 21:09:29 | Updated | Tags: varianteval genotypeconcordance stratification

I would like to evaluate variant calls to produce a plot (psuedo-ROC) of sensitivity vs. specificity (or concordance, etc) when I condition on a minimum/maximum value for a particular metric (coverage, genotype quality, etc.). I can do this by running VariantEval or GenotypeConcordance multiple times, once for each cutoff value, but this is inefficient, since I believe I should be able to compute these values in one pass. Alternatively, if there was a simple tool to annotate each variant as concordance or discordant, I could tabulate the results myself. I would like to rely upon GATK's variant comparison logic to compare variants (especially indels). Any thoughts on if current tools can be parameterized, or adapted for these purposes?

N

Created 2013-04-04 08:39:18 | Updated | Tags: varianteval

Hi,

I have processed 10 whole-exome samples using the GATK best practices workflow (GATK v2.4-3-g2a7af43). I am currently evaluating my variant call set (generated from HaplotypeCaller) with OMNI 2.5 SNP array (comparison set) and dbSNP 137.

I have included 2 rows from the Ti/Tv Variant Evaluator table:

CompRod EvalRod Novelty Sample nTi nTv tiTvRatio nTiInComp nTvInComp TiTvRatioStandard OMNI MyCalls all all 79945 30322 2.64 993588 274219 3.62 dbsnp MyCalls all all 79945 30322 2.64 30214009 15253850 1.98

According to literature survey, the Ti/Tv ratio should be approximately 2.1 for whole genome sequencing and 2.8 for whole exome sequencing. Since I am getting Ti/Tv of 2.64 for exome, does this indicate false positives in the data? Also, what could be the rationale for getting such high TiTvRatioStandard for the OMNI whole genome data?

Thanks!

Created 2013-01-24 22:13:52 | Updated | Tags: varianteval jexl

Hello,

I'm currently running variantEval to count up variants per individual stratified by a variety of annotations.

My GATK call looks like:

java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar \ -T VariantEval \ -R Homo_sapiens_assembly19.fasta \ -o output.txt \ -L input.vcf \ -eval input.vcf \ -ST Sample -noST \ -noEV -EV CountVariants \ -ST JexlExpression --select_names "nonsynon" --select_exps "resource.VAT_CDS == 'nonsynonymous' && resource.FOUNDERS_FRQ > 0.05" \ -ST JexlExpression --select_names "synon" --select_exps "resource.VAT_CDS == 'synonymous' && resource.FOUNDERS_FRQ > 0.05" ...

where the VAT_CDS section of the INFO field in the VCF has a functional annotation or is set to "na" if an annotation is unavailable. I'm getting the following error:

##### ERROR ------------------------------------------------------------------------------------------

but weirdly the error is not consistent (my data is stratified by chromosome and most chromosomes will run without throwing the error while one or two chromosomes do exhibit the error. Do you have any ideas what's causing this behavior?

Thanks!

Created 2013-01-16 15:05:06 | Updated 2013-01-17 16:23:28 | Tags: varianteval exome community

Hello,

I found the materials of the BroadE Workshop very helpful, especially the slide on analyzing variant calls using VariantEval, because there is not much documentation for it on GATK site. As an example 62 whole genome sequencing samples from north Europe were evaluated together with 1000G FIN samples, and also the polymorphic and monomorphic sites on the 1000G genotype chip were used as comparator. I would like very much to do the same for our whole exome data, the question is: is there good quality whole exome data that I can use to evaluate our own exome data?

I have checked the NHLBI ESP project Exome Variant Server site, the vcf files can be downloaded doesn't have the genotype data.

Created 2013-01-08 13:47:41 | Updated 2013-01-09 03:06:09 | Tags: varianteval

Hello, I've just started using VariantEval. It seems very useful and provides a good deal of output. However, it's not obvious to me what all of the fields mean. Do you know when there will be documentation of the output?

For example, some things I'd like to know:

VariantSummary:

• What is the difference between TiTvRatio and TiTvRatioPerSample?
• What is SNPNoveltyRate? (In general I don't understand the rate fields)

TiTvVariantEvaluator:

• What is TiTvRatioStandard? What are the nTiDerived and nTvDerived fields?

CountVariants:

• What are variantRate and variantRatePerBp? Is variantRatePerBp the average number of bp between variants? (If anything it seems like the names for these should be reversed.)
• What are heterozygosity and heterozygosityPerBp?

What is the --dbsnp parameter used for? Is this redundant with --comp?

Jeremy

Created 2012-12-19 13:02:10 | Updated | Tags: varianteval multi-sample variant

Hi!

I want to know what's the best way to use VariantEval to get statistics for each sample in a multisample VCF file. If I call it like this:  java -jar GenomeAnalysisTK.jar \ -R ucsc.hg19.fasta \ -T VariantEval \ -o multisample.eval.gatkreport \ --eval annotated.combined.vcf.gz \ --dbsnp dbsnp_137.hg19.vcf  where annotated.combined.vcf.gz is a VCF file that contains ~1Mio variants for ~800 samples I get statistics for all samples combined, e.g.

 #:GATKReport.v1.1:8 #:GATKTable:11:3:%s:%s:%s:%s:%s:%d:%d:%d:%.2f:%d:%.2f:; #:GATKTable:CompOverlap:The overlap between eval and comp sites CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants ... CompOverlap dbsnp eval none all 471704 191147 CompOverlap dbsnp eval none known 280557 0 CompOverlap dbsnp eval none novel 191147 191147 
But I would like to get one such entry per sample. Is there an easy way to do this?

Thanks, Thomas

Created 2012-12-04 20:12:44 | Updated 2012-12-04 20:15:44 | Tags: varianteval jexl genotype

While running VariantEval, I'm trying to stratify by a JexlExpression by setting using

-ST Sample -ST JexlExpression -select "GQ>20"

This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.

My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?

Would you like this behavior or are you happy with the current operation of jexlExpression?

Cheers!

Created 2012-11-22 21:09:04 | Updated | Tags: documentation selectvariants varianteval

Hello,

I've a couple of essentially documentation-centric questions...

Firstly, the SelectVariants documentation describes selecting 1000 random variants from a vcf using '-number 1000', however when I try that (with the command "java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta --variant variants.vcf -o 1000.vcf -number 1000") it produces the error 'Argument with the name number isn't defined'. Trying --number instead doesn't make any difference, while the --help output does not list this argument (GATK 2.2.2). It this option no longer available?

Secondly, the 'gatkforums.broadinstitute.org/discussion/48/using-varianteval#Understanding_Genotype_Concordance_values_from_Variant_Eval' section of the 'Using VariantEval' page has a series of images explaining the concordance values, however the images are missing. Please could these be restored?

Many thanks, James

Created 2012-09-24 21:04:12 | Updated 2012-09-25 15:53:45 | Tags: varianteval genotypeconcordance

Hi,

I am using VariantEval --evalModule GenotypeConcordance in order to establish concordance and sensitivity metrics against a HapMap reference. In the resulting GATK report I obtain the following fields for a given SNP category (example with HETs):

GenotypeConcordance  CompRod  EvalRod  JexlExpression  Novelty  variable                                value---
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HET                       6220
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HOM_REF                      0
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HOM_VAR                     20
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_MIXED                        0
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_NO_CALL                    318
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_UNAVAILABLE                  0

What is the meaning of the _MIXED and _UNAVAILABLE fields?

Thx, Gene

Created 2012-08-23 20:13:34 | Updated 2012-09-04 16:18:19 | Tags: varianteval

I'm on unstable v2.1-61-gde29ed5, Compiled 2012/08/23 16:03:06

I run VariantEval on the file below and get this:

java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

The file is this (after header)

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  A3N     A4N     A5N     A6N     A8N     A9N     H10N    H15N    H17N    H26N    H27N    H32N    H52N    H53N   H54N     H55N    H56N    H57N    H58N    H61N    H62N    H6N     H7N
1       901922  rs62639980      G       A,C     1332.23 PASS    AC=1,2;AF=6.849e-03,0.014;AN=146;BaseQRankSum=-3.705;DB;DP=2675;DS;Dels=0.00;FS=3.860;HaplotypeScore=1.6813;InbreedingCoeff=-0.0001;MLEAC=2,2;MLEAF=0.014,0.014;MQ=58.62;MQ0=1;MQRankSum=2.225;QD=7.84;ReadPosRankSum=3.761;SB=-5.136e+02;VQSLOD=3.9184;culprit=HaplotypeScore  GT:AD:DP:GQ:PL  0/0:1,0,0:59:99:0,178,2360,178,2360,2360        0/0:60,0,0:60:99:0,156,2055,156,2055,2055       0/0:60,0,0:60:99:0,160,2191,160,2191,2191       0/0:1,0,0:64:99:0,193,2516,193,2516,2516        0/0:1,0,0:47:99:0,141,1776,141,1776,1776        0/0:1,0,0:62:99:0,187,2480,187,2480,2480        0/0:59,1,0:60:99:0,119,1965,147,1968,1996       0/0:56,0,0:56:99:0,135,1857,135,1857,1857      0/0:1,0,0:57:99:0,172,2280,172,2280,2280 0/0:1,0,0:75:99:0,226,3000,226,3000,3000        0/0:8,0,0:71:99:0,214,2852,214,2852,2852        0/0:60,0,0:60:99:0,147,1942,147,1942,1942      0/0:47,0,0:47:99:0,123,1654,123,1654,1654        0/0:1,0,0:56:99:0,169,2240,169,2240,2240        0/0:1,0,0:55:99:0,166,2122,166,2122,2122        0/2:37,0,23:60:99:647,734,1844,0,1110,1050      0/0:1,0,0:71:99:0,214,2840,214,2840,2840        0/2:44,0,16:60:99:377,483,1801,0,1319,1277      0/0:57,0,0:57:99:0,141,1914,141,1914,1914       0/0:1,0,0:53:99:0,160,2154,160,2154,2154        0/0:1,0,0:74:99:0,223,3008,223,3008,3008        0/0:1,0,0:49:99:0,147,1960,147,1960,1960        0/0:1,0,0:76:99:0,229,2988,229,2988,2988

The file was created by first calling my ~20 samples with 50 random samples from 1kg and then subsetting to my samples using SelectVariants.

Created 2012-08-14 01:32:37 | Updated 2012-08-14 01:32:37 | Tags: varianteval

Hi, I'm running 1.6-512-gafa4399 (unstable).

When running VariantEval, I got an error: "Couldn't find state for 88 at node org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratNode"

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)
-T VariantEval -L /xchip/cga/reference/hg19/whole_exome_agilent_1.1_refseq_plus_3_boosters_plus_10bp_padding_minus_mito.Homo_sapiens_assembly19.targets.interval_list -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta -nt 1 -o myFile.byAC.eval -eval myFile.vcf -D dbsnp_132_b37.leftAligned.vcf -gold /humgen/gsa-hpprojects/GATK/bundle/current/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -ST EvalRod -ST CompRod -ST Novelty -ST FunctionalClass -ST AlleleCount -noST -EV TiTvVariantEvaluator -EV CountVariants -EV CompOverlap -EV IndelSummary -noEV