Running through the steps involved in variant discovery (calling variants, joint genotyping and applying filters) produces a variant callset in the form of a VCF file. So what’s next? Technically, that callset is ready to be used in downstream analysis. But before you do that, we recommend running some quality control analyses to evaluate how “good” that callset is.
To be frank, distinguishing between a “good” callset and a “bad” callset is a complex problem. If you knew the absolute truth of what variants are present or not in your samples, you probably wouldn’t be here running variant discovery on some high-throughput sequencing data. Your fresh new callset is your attempt to discover that truth. So how do you know how close you got?
There are several methods that you can apply which offer different insights into the probable biological truth, all with their own pros and cons. Possibly the most trusted method is Sanger sequencing of regions surrounding putative variants. However, it is also the least scalable as it would be prohibitively costly and time-consuming to apply to an entire callset. Typically, Sanger sequencing is only applied to validate candidate variants that are judged highly likely. Another popular method is to evaluate concordance against results obtained from a genotyping chip run on the same samples. This is much more scalable, and conveniently also doubles as a quality control method to detect sample swaps. Although it only covers the subset of known variants that the chip was designed for, this method can give you a pretty good indication of both sensitivity (ability to detect true variants) and specificity (not calling variants where there are none). This is something we do systematically for all samples in the Broad’s production pipelines.
The third method, presented here, is to evaluate how your variant callset stacks up against another variant callset (typically derived from other samples) that is considered to be a truth set (sometimes referred to as a gold standard -- these terms are very close and often used interchangeably). The general idea is that key properties of your callset (metrics discussed later in the text) should roughly match those of the truth set. This method is not meant to render any judgments about the veracity of individual variant calls; instead, it aims to estimate the overall quality of your callset and detect any red flags that might be indicative of error.
It should be immediately obvious that there are two important assumptions being made here: 1) that the content of the truth set has been validated somehow and is considered especially trustworthy; and 2) that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set. These assumptions are not always well-supported, depending on the truth set, your callset, and what they have (or don’t have) in common. You should always keep this in mind when choosing a truth set for your evaluation; it’s a jungle out there. Consider that if anyone can submit variants to a truth set’s database without a well-regulated validation process, and there is no process for removing variants if someone later finds they were wrong (I’m looking at you, dbSNP), you should be extra cautious in interpreting results. *With apologies to Stephen Colbert.
So what constitutes validation? Well, the best validation is done with orthogonal methods, meaning that it is done with technology (wetware, hardware, software, etc.) that is not subject to the same error modes as the sequencing process. Calling variants with two callers that use similar algorithms? Great way to reinforce your biases. It won’t mean anything that both give the same results; they could both be making the same mistakes. On the wetlab side, Sanger and genotyping chips are great validation tools; the technology is pretty different, so they tend to make different mistakes. Therefore it means more if they agree or disagree with calls made from high-throughput sequencing.
Regarding the population genomics aspect: it’s complicated -- especially if we’re talking about humans (I am). There’s a lot of interesting literature on this topic; for now let’s just summarize by saying that some important variant calling metrics vary depending on ethnicity. So if you are studying a population with a very specific ethnic composition, you should try to find a truth set composed of individuals with a similar ethnic background, and adjust your expectations accordingly for some metrics.
Similar principles apply to non-human genomic data, with important variations depending on whether you’re looking at wild or domesticated populations, natural or experimentally manipulated lineages, and so on. Unfortunately we can’t currently provide any detailed guidance on this topic, but hopefully this explanation of the logic and considerations involved will help you formulate a variant evaluation strategy that is appropriate for your organism of interest.
So let’s say you’ve got your fresh new callset and you’ve found an appropriate truth set. You’re ready to look at some metrics (but don’t worry yet about how; we’ll get to that soon enough). There are several metrics that we recommend examining in order to evaluate your data. The set described here should be considered a minimum and is by no means exclusive. It is nearly always better to evaluate more metrics if you possess the appropriate data to do so -- and as long as you understand why those additional metrics are meaningful. Please don’t try to use metrics that you don’t understand properly, because misunderstandings lead to confusion; confusion leads to worry; and worry leads to too many desperate posts on the GATK forum.
The relationship between variant-level concordance and genotype concordance is illustrated in this figure.
Variant-level concordance (aka % Concordance) gives the percentage of variants in your samples that match (are concordant with) variants in your truth set. It essentially serves as a check of how well your analysis pipeline identified variants contained in the truth set. Depending on what you are evaluating and comparing, the interpretation of percent concordance can vary quite significantly. Comparing your sample(s) against genotyping chip results matched per sample allows you to evaluate whether you missed any real variants within the scope of what is represented on the chip. Based on that concordance result, you can extrapolate what proportion you may have missed out of the real variants not represented on the chip. If you don't have a sample-matched truth set and you're comparing your sample against a truth set derived from a population, your interpretation of percent concordance will be more limited. You have to account for the fact that some variants that are real in your sample will not be present in the population and that conversely, many variants that are in the population will not be present in your sample. In both cases, "how many" depends on how big the population is and how representative it is of your sample's background. Keep in mind that for most tools that calculate this metric, all unmatched variants (present in your sample but not in the truth set) are considered to be false positives. Depending on your trust in the truth set and whether or not you expect to see true, novel variants, these unmatched variants could warrant further investigation -- or they could be artifacts that you should ignore.
These metrics are widely applicable. The table below summarizes their expected value ranges for Human Germline Data:
|Sequencing Type||# of Variants*||TiTv Ratio|
*for a single sample
Number of Indels & SNPs The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.
-ipengine argument) because this improves calling of variants that are at the edges of exons (whether inside the exon sequence or in the promoter/regulatory sequence before the exon). These flanking sequences are not subject to the same evolutionary pressures as the exons themselves, so the number of transition and transversion mutants lean away from the expected ratio. The amount of "lean" depends on how long the flanking sequence is.
This metric is generally evaluated after filtering for purposes that are specific to your study, and the expected value range depends on whether you're looking for rare or common variants, as summarized in the table below.
|Filtering for||Indel Ratio|
A significant deviation from the expected ratios listed in the table above could indicate a bias resulting from artifactual variants.
This is the GATK’s main tool for variant evaluation. It is designed to collect and calculate a variety of callset metrics that are organized in evaluation modules, which are listed in the tool doc. For each evaluation module that is enabled, the tool will produce a table containing the corresponding callset metrics based on the specified inputs (your callset of interest and one or more truth sets). By default, VariantEval will run with a specific subset of the available modules (listed below), but all evaluation modules can be enabled or disabled from the command line. We recommend setting the tool to produce only the metrics that you are interested in, because each active module adds to the computational requirements and overall runtime of the tool.
It should be noted that all module calculations only include variants that passed filtering (i.e. FILTER column in your vcf file should read PASS); variants tagged as filtered out will be ignored. It is not possible to modify this behavior. See the [example analysis]() (document in progress) for more details on how to use this tool and interpret its output.
This tool calculates -- you’ve guessed it -- the genotype concordance between callsets. In earlier versions of GATK, GenotypeConcordance was itself a module within VariantEval. It was converted into a standalone tool to enable more complex genotype concordance calculations.
The Picard toolkit includes two tools that perform similar functions to VariantEval and GenotypeConcordance, respectively called CollectVariantCallingMetrics and GenotypeConcordance. Both are relatively lightweight in comparison to their GATK equivalents; their functionalities are more limited, but they do run quite a bit faster. See the [example analysis]() (document in progress) of CollectVariantCallingMetrics for details on its use and data interpretation. Note that in the coming months, the Picard tools are going to be integrated into the next major version of GATK, so at that occasion we plan to consolidate these two pairs of homologous tools to eliminate redundancy.
We recommend Picard's version of each tool for most cases. The GenotypeConcordance tools provide mostly the same information, but Picard's version is preferred by Broadies. Both VariantEval and CollectVariantCallingMetrics produce similar metrics, however the latter runs faster and is scales better for larger cohorts. By default, CollectVariantCallingMetrics stratifies by sample, allowing you to see the value of relevant statistics as they pertain to specific samples in your cohort. It includes all metrics discussed here, as well as a few more. On the other hand, VariantEval provides many more metrics beyond the minimum described here for analysis. It should be noted that none of these tools use phasing to determine metrics.
So when should I use CollectVariantCallingMetrics?
When should I use VariantEval?
This document will walk you through use of Picard's CollectVariantCallingMetrics tool, an excellent tool for large callsets, especially if you need your results quickly and do not require many additional metrics to those described here. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.
java -jar picard.jar CollectVariantCallingMetrics \ INPUT=CEUtrio.vcf \ OUTPUT=CEUtrioMetrics \ DBSNP=dbsnp_138.b37.excluding_sites_after_129.vcf
After running the command, CollectVariantCallingMetrics will return both a detail and a summary metrics file. These files can be viewed as a text file if needed, or they can be read in as a table using your preferred spreadsheet viewer (e.g. Excel) or scripting language of your choice (e.g. python, R, etc.) The files contain headers and are tab-delimited; the commands for reading in the tables in RStudio are found below. (Note: Replace "~/path/to/" with the path to your output files as needed.)
summary <- read.table("~/path/to/CEUtrioMetrics.variant_calling_summary_metrics", header=TRUE, sep="\t") detail <- read.table("~/path/to/CEUtrioMetrics.variant_calling_detail_metrics", header=TRUE, sep="\t")
HET_HOMVAR_RATIO. In the example case here, the detail file will contain metrics for the three different samples, NA12892, NA12891, and NA12878.
*Concatenated in the above table are the detail file's (rows 1-3) and the summary file's (row 4) relevant metrics; for full output table, see attached image file.
Since vcf files usually do not contain invariant sites, the Sensitivity, Discrepancy, and Concordence computed by GC are not correct actually. How to solve this problem?
Even if the two input vcf files do contain invariant sites, those three metrics should be computed ONLY in the intersected sites of the two vcf files; but currently they are computed in the union sites.
Thanks for any suggestion!
Dear GATK Team,
I performed a variant caller comparison, and the genotype concordance analysis of HaplotypeCaller's results a little strange to me, and I can't understand it at all.
ALLELES_MATCH: 0 EVAL_SUPERSET_TRUTH: 2438 EVAL_SUBSET_TRUTH: 0 ALLELES_DO_NOT_MATCH: 636 EVAL_ONLY: 20 TRUTH_ONLY: 1014546
The last and first value could be true? Or anyone can tell what could cause this? I got very different (and reasonable) results with other callers.
We have VCFs that we've applied genotype-level filters to using FilterVariants, and we are interested in using your –gfc/gfe flags in order to allow all genotypes with an non-"PASS" FT annotation to be set to no call for a concordance check using GenotypeConcordance.
I have tried using the follow JEXL expression 'FT!="PASS"' for –gfc and –gfe but this has not been successful in setting the genotypes that FAIL to no call. Is there way to achieve this genotype-level filtration within GATK GenotypeConcordace? Is there a different JEXL expression that I should be using?
(Asking on behalf of @azo121 who isn't able to post yet)
Hi, I didn't find the answer anywhere, so I'm asking here. I'm using GATK 2.6-4. I used GenotypeConcordance to compare two datasets, and it was great, exactly what I needed. But then, I was interesting in the discordant sites, and I saw that printInterestingSites should be a good solution. But I can't make it work. Just adding in my command line "-sites filename" doesn't work, and I've tried to be creative in changing this option. Then I thought, that maybe this option isn't working in my GATK version. Have I just done something wrong?
Could anyone help me with two questions in comparing my vcf file to gold standard?
Hello, I just ran genotype concordance in order to determine how similar two samples were. However, in the output, everything is showing up as zero, the NRD determined is 1 and the overall genotype concordance is also 1. Has anyone encountered this before. The two samples I am using for --eval and --comp are actually the same sample that were sequenced using two different methodologies (HiSeq and Ion proton). So it is odd that I am getting this output. Any help is appreciated.
I'd like to check the accuracy of genotypes imputed from exome seq data of NA12891, as compared to those genotypes from GWAS chips. But the GWAS chips data I found from hapmap are all in b36, does anyone know where I can get those for b37?
If there is no chip data for b37 and since the liftover tool isn't working, I wonder how the SNPs are mapped between comp and eval in GenotypeConcordance? Are they mapped by SNP IDs or coordinates? If by SNP IDs, maybe it's OK to compare a vcf file in b37 to one in b36 since the SNP IDs are not changed?
I hope this isn't a stupid question. I would like to compare genotypes between samples in a vcf. For example, in the vcf I have sample A, sample B and sample C and I would like to know the concordance between A and B and A and C. Is there a GATK tool that can do this? I have tried using GenotypeConcordance and VariantEval and supplying the multi-sample vcf as the file for evaluation (--eval) and a vcf containing only sample A (generated using SelectVariants) as the comparison (--comp). However, it doesn't produce the output I want. For example, GenotypeConcordance gives me concordance between ALL samples in the --eval file, rather than on a sample-by-sample basis.
Just to make sure my understanding is correct:
HET: heterozygous HOM_REF: homozygous reference HOM_VAR: homozygous variant MIXED: something like `./1` Mismatching_Alleles: ?? UNAVAILABLE: for internal use ALLELES_MATCH: ?? ALLELES_DO_NOT_MATCH: ?? EVAL_ONLY: ?? TRUTH_ONLY: does it actually mean the variants present in comp but not in eval, like COMP_ONLY?
how does the following computed?
Non-Reference_Discrepancy Non-Reference_Sensitivity Overall_Genotype_Concordance
Thanks a lot!
I would like to evaluate variant calls to produce a plot (psuedo-ROC) of sensitivity vs. specificity (or concordance, etc) when I condition on a minimum/maximum value for a particular metric (coverage, genotype quality, etc.). I can do this by running VariantEval or GenotypeConcordance multiple times, once for each cutoff value, but this is inefficient, since I believe I should be able to compute these values in one pass. Alternatively, if there was a simple tool to annotate each variant as concordance or discordant, I could tabulate the results myself. I would like to rely upon GATK's variant comparison logic to compare variants (especially indels). Any thoughts on if current tools can be parameterized, or adapted for these purposes?
Thanks for your help in advance,
I have the following problem:
I am evaluating genotype concordance using:
-T VariantEval --evalModule GenotypeConcordance -comp ref.vcf -eval sample1.observed.vcf
If I use a reference genotype file with multiple samples in it where one of the genotype columns is NA1234 (the sample in question), then the sensitivity for all SNP types (HOM_REF,HET,HOM_VAR) decreases drastically. This is because the GATK gets confused when there is more than one sample in the reference file. I know this because if I use a reference genotype file (ref.vcf) with only a single hapmap sample (NA1234) everything works fine and sensitivity is good. So this is not a detection problem is a problem when SNPs are being compared against the reference.
I tried passing the sample name using the --sample parameter for -T VariantEval, but this does not work either (sensitivity is still way off).
In previous versions of the GATK this was done automatically where genotypes where compared based on the sample name within the detection vcf file (sample1.observed.vcf ) vs the ref.vcf file without having to specify the sample name explicitly.
How can I avoid this problem? I want to have a master reference genotype file with multiple samples that I can use for different samples.
I am using GATK version v1.6
Thank you, Gene
I'm looking to find all the entries that change between two calls to UG on the same data. I would like to find all the entries where the call in the variant tract are different from those in the comparison track. So in effect I want those entries that would not be result from -using -conc in SelectVariants. From the documentation is is unclear if the -disc option does this:
A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track.
What if the comp is HOM_VAR and the variant track is HET? Or if they are both HET but disagree on the specific allele?
I'd like to know if someone has tested the concordance from output of PhaseByTransmission with SNP array data.
I have calculated the genotype concordance for the most likely GT combination from the VCF obtained from unified genotyper for a family trio based on the GL values against SNP array data and then did the same for the genotypes obtained after using PhaseByTransmission and I'm seeing a drop in concordance.
Is this to be expected?
I am using VariantEval --evalModule GenotypeConcordance in order to establish concordance and sensitivity metrics against a HapMap reference. In the resulting GATK report I obtain the following fields for a given SNP category (example with HETs):
GenotypeConcordance CompRod EvalRod JexlExpression Novelty variable value--- GenotypeConcordance comp eval none all n_true_HET_called_HET 6220 GenotypeConcordance comp eval none all n_true_HET_called_HOM_REF 0 GenotypeConcordance comp eval none all n_true_HET_called_HOM_VAR 20 GenotypeConcordance comp eval none all n_true_HET_called_MIXED 0 GenotypeConcordance comp eval none all n_true_HET_called_NO_CALL 318 GenotypeConcordance comp eval none all n_true_HET_called_UNAVAILABLE 0
What is the meaning of the _MIXED and _UNAVAILABLE fields?