GenotypeConcordance

From GSA
Jump to: navigation, search

GenotypeConcordance is an evaluation module for VariantEval. It is comprised of two subtables (a detailed table and a simplified table) that calculate per-sample and overall metrics related to genotype concordance.

Contents

Introduction

GenotypeConcordance computes two different tables of metrics related to genotype concordance of an evaluation callset to a comparison callset. These tables are:

Table Description
detailedStats the concordance statistics for each sample
simplifiedStats the concordance statistics summary for each sample

Understanding the output of each subtable

GenotypeConcordance.detailedStats

Given an evaluation track and a comparison track, this table computes the number of homozygous-reference, heterozygous, and homozygous-variant sites per sample in the comparison track. Then, for each of those classes, it compares the called genotype in the evaluation track to the given genotype in the comparison track, and emits a table specifying the full "confusion matrix" (denoting how many genotypes are consistent with the comparison data, and if they are different, what genotype they are called instead).

This table has the following columns:

Table Description
row the sample name
total_true_ref the total number of true homozygous-reference sites in the sample (as indicated in the comparison track)
pct_ref_vs_ref the percentage of hom-ref sites in the comparison track that are called hom-ref in the evaluation track
n_ref_vs_no_call the number of hom-ref sites in the comparison track that are no-calls in the evaluation track
n_ref_vs_ref the number of hom-ref sites in the comparison track that are called hom-ref in the evaluation track
n_ref_vs_het the number of hom-ref sites in the comparison track that are called heterozygous in the evaluation track
n_ref_vs_hom the number of hom-ref sites in the comparison track that are called hom-var in the evaluation track
total_true_het the total number of true heterozygous-reference sites in the sample (as indicated in the comparison track)
pct_het_vs_het the percentage of het sites in the comparison track that are called het in the evaluation track
n_het_vs_no_call the number of het sites in the comparison track that are no-calls in the evaluation track
n_het_vs_ref the number of het sites in the comparison track that are called hom-ref in the evaluation track
n_het_vs_het the number of het sites in the comparison track that are called het in the evaluation track
n_het_vs_hom the number of het sites in the comparison track that are called hom-var in the evaluation track
total_true_hom the total number of true homozygous-variant sites in the sample (as indicated in the comparison track)
pct_hom_vs_hom the percentage of hom-var sites in the comparison track that are called hom-var in the evaluation track
n_hom_vs_no_call the percentage of hom-var sites in the comparison track that are no-calls in the evaluation track
n_hom_vs_ref the percentage of hom-var sites in the comparison track that are called hom-ref in the evaluation track
n_hom_vs_het the percentage of hom-var sites in the comparison track that are called het in the evaluation track
n_hom_vs_hom the percentage of hom-var sites in the comparison track that are called hom-var in the evaluation track

GenotypeConcordance.simplifiedStats

This table is a simplified version of the one above, containing only the percentage of genotypes of each class called correctly, and the non-reference sensitivity, overall genotype concordance, and non-reference discrepancy rate metrics commonly used in evaluating a variant callset against comparison data.

Table Description
row the sample name
percent_comp_ref_called_ref the percentage of hom-ref sites in the comparison track that are called hom-ref in the evaluation track
percent_comp_het_called_het the percentage of het sites in the comparison track that are called het in the evaluation track
percent_comp_hom_called_hom the percentage of hom-var sites in the comparison track that are called hom-var in the evaluation track
percent_non_reference_sensitivity
Non-reference sensitivity
Measures fraction of sites called variant (A/B or B/B) in comparison that are also called variant in evaluation data.
percent_overall_genotype_concordance
Overall genotype concordance
Measures accuracy of genotype calls at all loci (excluding no-calls in either set). This is often biased towards A/A loci and is not recommended for routine analysis.
percent_non_reference_discrepancy_rate
Non-reference discrepancy rate
Measures accuracy of genotype calls at sites called by both sets (excluding concordant A/A genotypes since these are often large in number and easier to get correct). This is a good metric for assaying accuracy of genotype calls.

GenotypeConcordance.simplifiedStats example

The table below gives an example of what the non-reference sensitivity (NRS), non-reference discrepancy rate (NRD), and overall genotype concordance (OGC) would be given the number of genotypes of each class present in the evaluation set and the comparison set.

Example calculation
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox