1 documentation article | 0 announcements | 1 forum discussion

Created 2014-10-21 21:56:58 | Updated 2014-11-07 18:37:29 | Tags: statistics annotations

This document describes the statistical methods used by various GATK tools, especially the variant annotations. The methods are arranged alphabetically below.

Note that this document is in progress; we will be adding more over time. Please check back later or subscribe to forum notifications to receive an email when we update the content in this section. Feel free to leave a comment saying which stats methods used by GATK you'd like to see documented here.

- Rank Sum Test

The Rank Sum Test, also known as Mann-Whitney-Wilcoxon U-test after its developers (who are variously credited in subsets and in different orders depending on the sources you read) is a statistical test that aims to determine whether there is significant difference in the values of two populations of data.

The Wikipedia article about the Rank Sum Test includes some very helpful information on the theoretical underpinnings of the test, as well as various examples of how it can be applied.

This test is used by several GATK annotations, including two standard annotations that are used for variant recalibration in the Best Practices: MappingQualityRankSum and ReadPosRankSum. In all cases, the idea is to check, for a given candidate variant, whether the properties of the data that support the reference allele are similar to those of the data that support a variant allele. If they are not similar, we conclude that there may be some technical bias and that the candidate variant may be an artifact.

*Note: this example applies Method 2 from the Wikipedia article linked above.*

In this example, we have a set of 20 reads, 10 of which support the reference allele and 10 of which support the alternate allele. At first glance, that looks like a clear heterozygous 0/1 site. But to be thorough in our analysis and to account for any technical bias, we want to determine if there is a significant difference in the base qualities of the bases that support the reference allele vs. the bases that support the alternate allele.

Before we proceed, we must define our null hypothesis and alternate hypothesis.

-*Null hypothesis:* There is **no** difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

-*Alternate hypothesis:* There **is** a difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

Reference allele base qualities: 20, 25, 26, 30, 32, 40, 47, 50, 53, 60 Alternate allele base qualities: 0, 7, 10, 17, 20, 21, 30, 34, 40, 45

First, we arrange all the observations (base qualities) into a list of values ordered from lowest to highest (reference bases are in bold).

0, 7, 10, 17, **20**, 20, 21, **25**, **26**, **30**, 30, **32**, 34, **40**, 40, 45, **47**, **50**, **53**, **60**

Next we determine the ranks of the values. Since there are 20 observations (the base qualities), we have 20 ranks to assign. Whenever there are ties between observations for the rank, we take the rank to be equal to the midpoint of the ranks. For example, for 20(ref) and 20(alt), we have a tie in values, so we assign each observation a rank of (5+6)/2 = 5.5.

The ranks from the above list are (reference ranks are in bold):

1, 2, 3, 4, **5.5**, 5.5, 7, **8**, **9**, **10.5**, 10.5, **12**, 13, **14.5**, 14.5, 16, **17**, **18**, **19**, **20**

We now need to add up the ranks for the base qualities that came from the reference allele and the alternate allele.

$$ Rank_{ref} = 133.5 $$ $$ Rank_{alt} = 76.5 $$

U is a statistic that tells us the difference between the two rank totals. We can use the U statistic to calculate the z-score (explained below), which will give us our p-value.

Calculate U for each group (n = number of observations in each sample)

$$ U_{ref} = \frac{ n_{ref} * n_{alt} + n_{ref} * (n_{ref}+ 1) }{ 2 } - Rank_{ref} $$ $$ U_{alt} = \frac{ n_{alt} * n_{ref} + n_{alt} * (n_{alt} + 1) }{ 2 } - Rank_{alt} $$

$$ U_{ref} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 133.5 = 21.5 $$ $$ U_{alt} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 76.5 = 78.5 $$

Next, we need to calculate the z-score which will allow us to get the p-value. The z-score is a normalized score that allows us to compare the probability of the U score occurring in our distribution. https://statistics.laerd.com/statistical-guides/standard-score.php

The equation to get the z-score is:

$$ z = \frac{U - mu}{u} $$

Breaking this equation down:

$$ z = z-score $$ $$ U = \text{lowest of the U scores calculated in previous steps} $$

$$ mu = \text{mean of the U scores above} = \frac{ n_{ref} * n_{alt} }{ 2 } $$

$$ u = \text{standard deviation of U} = \sqrt{ \frac{n_{ref} * n_{alt} * (n_{ref} + n_{alt} + 1) }{ 12 } } $$

To calculate our z:

$$ U = 21.5 $$ $$ mu = \frac{10 * 10 }{ 2 } = 50 $$ $$ u = \sqrt{ \frac{10 * 10 *(10 + 10 + 1) }{ 12 } } = 13.229 $$

So altogether we have:

$$ z = \frac{ 21.5 - 50 }{ 13.229 } = -2.154 $$

The p-value is the probability of obtaining a z-score at least as extreme as the one we got, assuming the null hypothesis is true. In our example, the p-value gives us the probability that there is no difference in the base qualities that support the reference allele and the base qualities that support the alternate allele. The lower the p-value, the less likely it is that there is no difference in the base qualities.

Going to the z-score table, or just using a p-value calculator, we find the p-value to be 0.0312.

This means there is a .0312 chance that the base quality scores of the reference allele and alternate allele are the same. Assuming a p-value cutoff of 0.05, meaning there is less than 5% chance there is no difference in the two groups, and greater than or equal to 95% chance that there is a difference between the two groups, we have enough evidence to **reject our null hypothesis** that there is no difference in the base qualities of the reference and alternate allele. This indicates there is some bias and that the alternate allele is less well supported by the data than the allele counts suggest.

No posts found with the requested search criteria.

Created 2013-02-02 15:23:08 | Updated 2013-02-02 22:49:30 | Tags: unifiedgenotyper vqsr haplotypecaller validation statistics sensitivity specificity

Hi all, I've somewhere in this site that before VQSR the FP rate is expected to be around 10% (I guess for UnifiedGenotyper). Are there some updated statistics for VQRS? For HaplotypeCaller? For Exome/WG data? Another thing: we apply VQRS on all our analysis, we are trying to collect some validation statistics. We suspect that most of the FP have some particular "culprits" in VQRS (especially QD and MQ). Do you have some data about this? Best

d