# Tagged with #genotypeconcordance 2 documentation articles | 0 announcements | 14 forum discussions

Created 2015-10-21 15:23:10 | Updated 2015-12-01 23:12:30 | Tags: varianteval genotypeconcordance collectvariantcallingmetrics indel-ratio titv truthset

## Introduction

Running through the steps involved in variant discovery (calling variants, joint genotyping and applying filters) produces a variant callset in the form of a VCF file. So what’s next? Technically, that callset is ready to be used in downstream analysis. But before you do that, we recommend running some quality control analyses to evaluate how “good” that callset is.

To be frank, distinguishing between a “good” callset and a “bad” callset is a complex problem. If you knew the absolute truth of what variants are present or not in your samples, you probably wouldn’t be here running variant discovery on some high-throughput sequencing data. Your fresh new callset is your attempt to discover that truth. So how do you know how close you got?

### Methods for variant evaluation

There are several methods that you can apply which offer different insights into the probable biological truth, all with their own pros and cons. Possibly the most trusted method is Sanger sequencing of regions surrounding putative variants. However, it is also the least scalable as it would be prohibitively costly and time-consuming to apply to an entire callset. Typically, Sanger sequencing is only applied to validate candidate variants that are judged highly likely. Another popular method is to evaluate concordance against results obtained from a genotyping chip run on the same samples. This is much more scalable, and conveniently also doubles as a quality control method to detect sample swaps. Although it only covers the subset of known variants that the chip was designed for, this method can give you a pretty good indication of both sensitivity (ability to detect true variants) and specificity (not calling variants where there are none). This is something we do systematically for all samples in the Broad’s production pipelines.

The third method, presented here, is to evaluate how your variant callset stacks up against another variant callset (typically derived from other samples) that is considered to be a truth set (sometimes referred to as a gold standard -- these terms are very close and often used interchangeably). The general idea is that key properties of your callset (metrics discussed later in the text) should roughly match those of the truth set. This method is not meant to render any judgments about the veracity of individual variant calls; instead, it aims to estimate the overall quality of your callset and detect any red flags that might be indicative of error.

### Underlying assumptions and truthiness*: a note of caution

It should be immediately obvious that there are two important assumptions being made here: 1) that the content of the truth set has been validated somehow and is considered especially trustworthy; and 2) that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set. These assumptions are not always well-supported, depending on the truth set, your callset, and what they have (or don’t have) in common. You should always keep this in mind when choosing a truth set for your evaluation; it’s a jungle out there. Consider that if anyone can submit variants to a truth set’s database without a well-regulated validation process, and there is no process for removing variants if someone later finds they were wrong (I’m looking at you, dbSNP), you should be extra cautious in interpreting results. *With apologies to Stephen Colbert.

### Validation

So what constitutes validation? Well, the best validation is done with orthogonal methods, meaning that it is done with technology (wetware, hardware, software, etc.) that is not subject to the same error modes as the sequencing process. Calling variants with two callers that use similar algorithms? Great way to reinforce your biases. It won’t mean anything that both give the same results; they could both be making the same mistakes. On the wetlab side, Sanger and genotyping chips are great validation tools; the technology is pretty different, so they tend to make different mistakes. Therefore it means more if they agree or disagree with calls made from high-throughput sequencing.

### Matching populations

Regarding the population genomics aspect: it’s complicated -- especially if we’re talking about humans (I am). There’s a lot of interesting literature on this topic; for now let’s just summarize by saying that some important variant calling metrics vary depending on ethnicity. So if you are studying a population with a very specific ethnic composition, you should try to find a truth set composed of individuals with a similar ethnic background, and adjust your expectations accordingly for some metrics.

Similar principles apply to non-human genomic data, with important variations depending on whether you’re looking at wild or domesticated populations, natural or experimentally manipulated lineages, and so on. Unfortunately we can’t currently provide any detailed guidance on this topic, but hopefully this explanation of the logic and considerations involved will help you formulate a variant evaluation strategy that is appropriate for your organism of interest.

## Variant evaluation metrics

So let’s say you’ve got your fresh new callset and you’ve found an appropriate truth set. You’re ready to look at some metrics (but don’t worry yet about how; we’ll get to that soon enough). There are several metrics that we recommend examining in order to evaluate your data. The set described here should be considered a minimum and is by no means exclusive. It is nearly always better to evaluate more metrics if you possess the appropriate data to do so -- and as long as you understand why those additional metrics are meaningful. Please don’t try to use metrics that you don’t understand properly, because misunderstandings lead to confusion; confusion leads to worry; and worry leads to too many desperate posts on the GATK forum.

### Variant-level concordance and genotype concordance

The relationship between variant-level concordance and genotype concordance is illustrated in this figure.

• Genotype concordance is a similar metric but operates at the genotype level. It allows you to evaluate, within a set of variant calls that are present in both your sample callset and your truth set, what proportion of the genotype calls have been assigned correctly. This assumes that you are comparing your sample to a matched truth set derived from the same original sample.

### Number of Indels & SNPs and TiTv Ratio

These metrics are widely applicable. The table below summarizes their expected value ranges for Human Germline Data:

Sequencing Type # of Variants* TiTv Ratio
WGS ~4.4M 2.0-2.1
WES ~41k 3.0-3.3

*for a single sample

• Number of Indels & SNPs The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.

• TiTv Ratio This metric is the ratio of transition (Ti) to transversion (Tv) SNPs. If the distribution of transition and transversion mutations were random (i.e. without any biological influence) we would expect a ratio of 0.5. This is simply due to the fact that there are twice as many possible transversion mutations than there are transitions. However, in the biological context, it is very common to see a methylated cytosine undergo deamination to become thymine. As this is a transition mutation, it has been shown to increase the expected random ratio from 0.5 to ~2.01. Furthermore, CpG islands, usually found in primer regions, have higher concentrations of methylcytosines. By including these regions, whole exome sequencing shows an even stronger lean towards transition mutations, with an expected ratio of 3.0-3.3. A significant deviation from the expected values could indicate artifactual variants causing bias. If your TiTv Ratio is too low, your callset likely has more false positives.
It should also be noted that the TiTv ratio from exome-sequenced data will vary from the expected value based upon the length of flanking sequences. When we analyze exome sequence data, we add some padding (usually 100 bases) around the targeted regions (using the -ip engine argument) because this improves calling of variants that are at the edges of exons (whether inside the exon sequence or in the promoter/regulatory sequence before the exon). These flanking sequences are not subject to the same evolutionary pressures as the exons themselves, so the number of transition and transversion mutants lean away from the expected ratio. The amount of "lean" depends on how long the flanking sequence is.

### Ratio of Insertions to Deletions (Indel Ratio)

This metric is generally evaluated after filtering for purposes that are specific to your study, and the expected value range depends on whether you're looking for rare or common variants, as summarized in the table below.

Filtering for Indel Ratio
common ~1
rare 0.2-0.5

A significant deviation from the expected ratios listed in the table above could indicate a bias resulting from artifactual variants.

## Tools for performing variant evaluation

### VariantEval

This is the GATK’s main tool for variant evaluation. It is designed to collect and calculate a variety of callset metrics that are organized in evaluation modules, which are listed in the tool doc. For each evaluation module that is enabled, the tool will produce a table containing the corresponding callset metrics based on the specified inputs (your callset of interest and one or more truth sets). By default, VariantEval will run with a specific subset of the available modules (listed below), but all evaluation modules can be enabled or disabled from the command line. We recommend setting the tool to produce only the metrics that you are interested in, because each active module adds to the computational requirements and overall runtime of the tool.

It should be noted that all module calculations only include variants that passed filtering (i.e. FILTER column in your vcf file should read PASS); variants tagged as filtered out will be ignored. It is not possible to modify this behavior. See the [example analysis]() (document in progress) for more details on how to use this tool and interpret its output.

### GenotypeConcordance

This tool calculates -- you’ve guessed it -- the genotype concordance between callsets. In earlier versions of GATK, GenotypeConcordance was itself a module within VariantEval. It was converted into a standalone tool to enable more complex genotype concordance calculations.

### Picard tools

The Picard toolkit includes two tools that perform similar functions to VariantEval and GenotypeConcordance, respectively called CollectVariantCallingMetrics and GenotypeConcordance. Both are relatively lightweight in comparison to their GATK equivalents; their functionalities are more limited, but they do run quite a bit faster. See the [example analysis]() (document in progress) of CollectVariantCallingMetrics for details on its use and data interpretation. Note that in the coming months, the Picard tools are going to be integrated into the next major version of GATK, so at that occasion we plan to consolidate these two pairs of homologous tools to eliminate redundancy.

### Which tool should I use?

We recommend Picard's version of each tool for most cases. The GenotypeConcordance tools provide mostly the same information, but Picard's version is preferred by Broadies. Both VariantEval and CollectVariantCallingMetrics produce similar metrics, however the latter runs faster and is scales better for larger cohorts. By default, CollectVariantCallingMetrics stratifies by sample, allowing you to see the value of relevant statistics as they pertain to specific samples in your cohort. It includes all metrics discussed here, as well as a few more. On the other hand, VariantEval provides many more metrics beyond the minimum described here for analysis. It should be noted that none of these tools use phasing to determine metrics.

So when should I use CollectVariantCallingMetrics?

• If you have a very large callset
• If you want to look at the metrics discussed here and not much else
• If you want your analysis back quickly

When should I use VariantEval?

• When you require a more detailed analysis of your callset
• If you need to stratify your callset by another factor (allele frequency, indel size, etc.)
• If you need to compare to multiple truth sets at the same time

Created 2015-09-23 12:06:10 | Updated 2015-12-01 23:20:26 | Tags: varianteval genotypeconcordance metrics collectvariantcallingmetrics titv

## Context

This document will walk you through use of Picard's CollectVariantCallingMetrics tool, an excellent tool for large callsets, especially if you need your results quickly and do not require many additional metrics to those described here. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.

## Example Use

### Command

java -jar picard.jar CollectVariantCallingMetrics \
INPUT=CEUtrio.vcf \
OUTPUT=CEUtrioMetrics \
DBSNP=dbsnp_138.b37.excluding_sites_after_129.vcf 
• INPUT The CEU trio (NA12892, NA12891, and 12878) from the 1000 Genomes Project is the input chosen for this example. It is the callset that we wish to examine the metrics on, and thus this is the field where you would specify the .vcf file containing your sample(s)'s variant calls.
• OUTPUT The output for this command will be written to two files named CEUtrioMetrics.variant_calling_summary_metrics and CEUtrioMetrics.variant_calling_detail_metrics, hereafter referred to as "summary" and "detail", respectively. The specification given in this field is applied as the name of the out files; the file extensions are provided by the tool itself.
• DBSNP The last required input to run this tool is a dbSNP file. The one used here is available in the current GATK bundle. CollectVariantCallingMetrics utilizes this dbSNP file as a base of comparison against the sample(s) present in your vcf.

### Getting Results

After running the command, CollectVariantCallingMetrics will return both a detail and a summary metrics file. These files can be viewed as a text file if needed, or they can be read in as a table using your preferred spreadsheet viewer (e.g. Excel) or scripting language of your choice (e.g. python, R, etc.) The files contain headers and are tab-delimited; the commands for reading in the tables in RStudio are found below. (Note: Replace "~/path/to/" with the path to your output files as needed.)

summary <- read.table("~/path/to/CEUtrioMetrics.variant_calling_summary_metrics", header=TRUE, sep="\t")
detail <- read.table("~/path/to/CEUtrioMetrics.variant_calling_detail_metrics", header=TRUE, sep="\t")
• Summary The summary metrics file will contain a single row of data for each metric, taking into account all samples present in your INPUT file.
• Detail The detail metrics file gives a breakdown of each statistic by sample. In addition to all metrics covered in the summary table, the detail table also contains entries for SAMPLE_ALIAS and HET_HOMVAR_RATIO. In the example case here, the detail file will contain metrics for the three different samples, NA12892, NA12891, and NA12878.

### Analyzing Results

*Concatenated in the above table are the detail file's (rows 1-3) and the summary file's (row 4) relevant metrics; for full output table, see attached image file.

• Number of Indels & SNPs This tool collects the number of SNPs (single nucleotide polymorphisms) and indels (insertions and deletions) as found in the variants file. It counts only biallelic sites and filters out multiallelic sites. Many factors affect these counts, including cohort size, relatedness between samples, strictness of filtering, ethnicity of samples, and even algorithm improvement due to updated software. While this metric alone is insufficient to evaluate your variants, it does provide a good baseline. It is reassuring to see that across the three related samples, we saw very similar numbers of SNPs and indels. It could be cause for concern if a particular sample had significantly more or fewer variants than the rest.
• Indel Ratio The indel ratio is determined to be the total number of insertions divided by the total number of deletions; this tool does not include filtered variants in this calculation. Usually, the indel ratio is around 1, as insertions occur typically as frequently as deletions. However, in rare variant studies, indel ratio should be around 0.2-0.5. Our samples have an indel ratio of ~0.95, indicating that these variants are not likely to have a bias affecting their insertion/deletion ratio.
• TiTv Ratio This metric is the ratio of transition (Ti) to transversion (Tv) mutations. For whole genome sequencing data, TiTv should be ~2.0-2.1, whereas whole exome sequencing data will have a TiTv ratio of ~3.0-3.31. In the case of the CEU trio of samples, the TiTv of ~2.08 and ~1.91 are within reason, and this variant callset is unlikely to have a bias affecting its transition/transversion ratio.
No posts found with the requested search criteria.

Created 2015-07-07 15:01:06 | Updated | Tags: genotypeconcordance

Since vcf files usually do not contain invariant sites, the Sensitivity, Discrepancy, and Concordence computed by GC are not correct actually. How to solve this problem?

Even if the two input vcf files do contain invariant sites, those three metrics should be computed ONLY in the intersected sites of the two vcf files; but currently they are computed in the union sites.

Thanks for any suggestion!

Created 2015-05-21 06:27:17 | Updated | Tags: haplotypecaller genotypeconcordance

Dear GATK Team,

I performed a variant caller comparison, and the genotype concordance analysis of HaplotypeCaller's results a little strange to me, and I can't understand it at all.

ALLELES_MATCH: 0 EVAL_SUPERSET_TRUTH: 2438 EVAL_SUBSET_TRUTH: 0 ALLELES_DO_NOT_MATCH: 636 EVAL_ONLY: 20 TRUTH_ONLY: 1014546

The last and first value could be true? Or anyone can tell what could cause this? I got very different (and reasonable) results with other callers.

Many thanks!

Created 2015-04-23 23:58:48 | Updated | Tags: jexl genotypeconcordance

We have VCFs that we've applied genotype-level filters to using FilterVariants, and we are interested in using your –gfc/gfe flags in order to allow all genotypes with an non-"PASS" FT annotation to be set to no call for a concordance check using GenotypeConcordance.

I have tried using the follow JEXL expression 'FT!="PASS"' for –gfc and –gfe but this has not been successful in setting the genotypes that FAIL to no call. Is there way to achieve this genotype-level filtration within GATK GenotypeConcordace? Is there a different JEXL expression that I should be using?

(Asking on behalf of @azo121 who isn't able to post yet)

Thanks!

Created 2014-12-18 08:52:46 | Updated | Tags: genotypeconcordance gatk

Hi, I didn't find the answer anywhere, so I'm asking here. I'm using GATK 2.6-4. I used GenotypeConcordance to compare two datasets, and it was great, exactly what I needed. But then, I was interesting in the discordant sites, and I saw that printInterestingSites should be a good solution. But I can't make it work. Just adding in my command line "-sites filename" doesn't work, and I've tried to be creative in changing this option. Then I thought, that maybe this option isn't working in my GATK version. Have I just done something wrong?

Created 2014-10-01 14:56:51 | Updated | Tags: genotypeconcordance

Could anyone help me with two questions in comparing my vcf file to gold standard?

1. Regarding sites present in my vcf but absent in gold standard, are they ignored?
2. Regarding sites absent in my vcf but present in gold standard, are they assumed 0/0?

Thanks,

Created 2014-09-17 16:45:28 | Updated | Tags: vcf genotypeconcordance

Hello, I just ran genotype concordance in order to determine how similar two samples were. However, in the output, everything is showing up as zero, the NRD determined is 1 and the overall genotype concordance is also 1. Has anyone encountered this before. The two samples I am using for --eval and --comp are actually the same sample that were sequenced using two different methodologies (HiSeq and Ion proton). So it is odd that I am getting this output. Any help is appreciated.

Thanks, Ricky

Created 2014-04-03 16:05:54 | Updated 2014-04-03 16:06:49 | Tags: genotypeconcordance

I'd like to check the accuracy of genotypes imputed from exome seq data of NA12891, as compared to those genotypes from GWAS chips. But the GWAS chips data I found from hapmap are all in b36, does anyone know where I can get those for b37?

If there is no chip data for b37 and since the liftover tool isn't working, I wonder how the SNPs are mapped between comp and eval in GenotypeConcordance? Are they mapped by SNP IDs or coordinates? If by SNP IDs, maybe it's OK to compare a vcf file in b37 to one in b36 since the SNP IDs are not changed?

Created 2014-03-27 14:50:15 | Updated | Tags: genotypeconcordance

Hi,

I hope this isn't a stupid question. I would like to compare genotypes between samples in a vcf. For example, in the vcf I have sample A, sample B and sample C and I would like to know the concordance between A and B and A and C. Is there a GATK tool that can do this? I have tried using GenotypeConcordance and VariantEval and supplying the multi-sample vcf as the file for evaluation (--eval) and a vcf containing only sample A (generated using SelectVariants) as the comparison (--comp). However, it doesn't produce the output I want. For example, GenotypeConcordance gives me concordance between ALL samples in the --eval file, rather than on a sample-by-sample basis.

Thanks

Kath

Created 2013-10-29 16:25:23 | Updated 2013-10-29 16:27:22 | Tags: genotypeconcordance

Just to make sure my understanding is correct:

HET: heterozygous
HOM_REF: homozygous reference
HOM_VAR: homozygous variant
MIXED: something like ./1
Mismatching_Alleles: ??
UNAVAILABLE: for internal use
ALLELES_MATCH: ??
ALLELES_DO_NOT_MATCH: ??
EVAL_ONLY: ??
TRUTH_ONLY: does it actually mean the variants present in comp but not in eval, like COMP_ONLY?

how does the following computed?

Non-Reference_Discrepancy
Non-Reference_Sensitivity
Overall_Genotype_Concordance

Thanks a lot!

Created 2013-06-10 21:09:29 | Updated | Tags: varianteval genotypeconcordance stratification

I would like to evaluate variant calls to produce a plot (psuedo-ROC) of sensitivity vs. specificity (or concordance, etc) when I condition on a minimum/maximum value for a particular metric (coverage, genotype quality, etc.). I can do this by running VariantEval or GenotypeConcordance multiple times, once for each cutoff value, but this is inefficient, since I believe I should be able to compute these values in one pass. Alternatively, if there was a simple tool to annotate each variant as concordance or discordant, I could tabulate the results myself. I would like to rely upon GATK's variant comparison logic to compare variants (especially indels). Any thoughts on if current tools can be parameterized, or adapted for these purposes?

N

Created 2012-11-06 19:59:43 | Updated 2012-11-06 20:03:47 | Tags: genotypeconcordance

Hi,

I have the following problem:

I am evaluating genotype concordance using:

-T VariantEval --evalModule GenotypeConcordance -comp ref.vcf -eval sample1.observed.vcf

If I use a reference genotype file with multiple samples in it where one of the genotype columns is NA1234 (the sample in question), then the sensitivity for all SNP types (HOM_REF,HET,HOM_VAR) decreases drastically. This is because the GATK gets confused when there is more than one sample in the reference file. I know this because if I use a reference genotype file (ref.vcf) with only a single hapmap sample (NA1234) everything works fine and sensitivity is good. So this is not a detection problem is a problem when SNPs are being compared against the reference.

I tried passing the sample name using the --sample parameter for -T VariantEval, but this does not work either (sensitivity is still way off).

In previous versions of the GATK this was done automatically where genotypes where compared based on the sample name within the detection vcf file (sample1.observed.vcf ) vs the ref.vcf file without having to specify the sample name explicitly.

How can I avoid this problem? I want to have a master reference genotype file with multiple samples that I can use for different samples.

I am using GATK version v1.6

Thank you, Gene

Created 2012-11-06 16:13:18 | Updated | Tags: selectvariants gatkdocs genotypeconcordance

I'm looking to find all the entries that change between two calls to UG on the same data. I would like to find all the entries where the call in the variant tract are different from those in the comparison track. So in effect I want those entries that would not be result from -using -conc in SelectVariants. From the documentation is is unclear if the -disc option does this:

A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track.

What if the comp is HOM_VAR and the variant track is HET? Or if they are both HET but disagree on the specific allele?

Thanks.

Created 2012-10-10 17:04:41 | Updated 2013-01-07 20:30:55 | Tags: phasebytransmission genotypeconcordance community snparray

Hi all,

I'd like to know if someone has tested the concordance from output of PhaseByTransmission with SNP array data.

I have calculated the genotype concordance for the most likely GT combination from the VCF obtained from unified genotyper for a family trio based on the GL values against SNP array data and then did the same for the genotypes obtained after using PhaseByTransmission and I'm seeing a drop in concordance.

Is this to be expected?

Thanks!

Created 2012-09-24 21:04:12 | Updated 2012-09-25 15:53:45 | Tags: varianteval genotypeconcordance

Hi,

I am using VariantEval --evalModule GenotypeConcordance in order to establish concordance and sensitivity metrics against a HapMap reference. In the resulting GATK report I obtain the following fields for a given SNP category (example with HETs):

GenotypeConcordance  CompRod  EvalRod  JexlExpression  Novelty  variable                                value---
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HET                       6220
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HOM_REF                      0
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HOM_VAR                     20
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_MIXED                        0
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_NO_CALL                    318
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_UNAVAILABLE                  0

What is the meaning of the _MIXED and _UNAVAILABLE fields?

Thx, Gene