1 documentation article | 0 announcements | 4 forum discussions

Created 2014-10-21 21:56:58 | Updated 2015-08-31 19:54:21 | Tags: statistics annotations

This document describes the statistical methods used by various GATK tools, especially the variant annotations. The methods are arranged alphabetically below.

Note that this document is in progress; we will be adding more over time. Please check back later or subscribe to forum notifications to receive an email when we update the content in this section. Feel free to leave a comment saying which stats methods used by GATK you'd like to see documented here.

- Inbreeding Coefficient
- Rank Sum Test

Although the name Inbreeding Coefficient suggests it is a measure of inbreeding, Inbreeding Coefficient measures the excess heterozygosity at a variant site. It can be used as a proxy for poor mapping (sites that have high Inbreeding Coefficients are typically locations in the genome where the mapping is bad and reads that are in the region mismatch the region because they belong elsewhere). At least 10 samples are required (preferably many more) in order for this annotation to be calculated properly.

The Wikipedia article about Hardy-Weinberg principle includes some very helpful information on the theoretical underpinnings of the test, as Inbreeding Coefficient relies on the math behind the Hardy-Weinberg Principle.

We calculate Inbreeding Coefficient as 1-(# observed heterozygotes)/(# expected heterozygotes). The number of observed heterozygotes can be calculated from the data. The number of expected heterozygotes is 2pq, where p is the frequency of the reference allele and q is the frequency of the alternate allele (AF). (Please see Hardy-Weinberg Principle link above). A value of 0 suggests the site is in Hardy-Weinberg Equilibrium. Negative values of Inbreeding Coefficient could mean there are too many heterozygotes and suggest a site with bad mapping. The other nice side effect is that one of the error modes in variant calling is for all calls to be heterozygous, which this metric captures nicely. This is why we recommend filtering out variants with negative Inbreeding Coefficients. Although positive values suggest too few heterozygotes, we do not recommend filtering out positive values because they could arise from admixture of different ethnic populations.

In this example, lets say we are working with 100 human samples, and we are trying to calculate Inbreeding Coefficient at a site that has A for the reference allele and T for the alternate allele.

A/A (hom-ref): 51 A/T (het): 11 T/T (hom-var): 38

We need to find the # observed hets and # expected hets.

number of observed hets = 11 (from number of observed A/T given above)

number of expected hets = 2pq * total genotypes (2pq is frequency of heterozygotes according to Hardy-Weinberg Equilibrium. We need to multiply that frequency by the number of all genotypes in the population to get the expected number of heterozygotes.)

p = frequency of ref allele = (# ref alleles)/(total # alleles) = (2 * 51 + 11)/(2 * 51 + 11 * 2 + 38 * 2) = 113/200 = 0.565 q = frequency of alt allele = (# alt alleles)/(total # alleles) = (2 * 38 + 11)/(2 * 51 + 11 * 2 + 38 * 2) = 87/200 = 0.435

number of expected hets = 2pq * 100 = 2 * 0.565 * 0.435 * 100 = 49.155

Inbreeding Coefficient = 1 - (# observed hets)/(#expected hets) = 1 - (11/49.155) = 0.776

Our Inbreeding Coefficient is 0.776. Because it is a positive number, we can see there are fewer than the expected number of heterozygotes according to the Hardy-Weinberg Principle. Too few heterozygotes can imply inbreeding. However, we do not recommend filtering this site out because there may be a mixture of ethnicities in the cohort, and some ethnicities may be hom-ref while others are hom-var.

The Rank Sum Test, also known as Mann-Whitney-Wilcoxon U-test after its developers (who are variously credited in subsets and in different orders depending on the sources you read) is a statistical test that aims to determine whether there is significant difference in the values of two populations of data.

The Wikipedia article about the Rank Sum Test includes some very helpful information on the theoretical underpinnings of the test, as well as various examples of how it can be applied.

This test is used by several GATK annotations, including two standard annotations that are used for variant recalibration in the Best Practices: MappingQualityRankSum and ReadPosRankSum. In all cases, the idea is to check, for a given candidate variant, whether the properties of the data that support the reference allele are similar to those of the data that support a variant allele. If they are not similar, we conclude that there may be some technical bias and that the candidate variant may be an artifact.

*Note: this example applies Method 2 from the Wikipedia article linked above.*

In this example, we have a set of 20 reads, 10 of which support the reference allele and 10 of which support the alternate allele. At first glance, that looks like a clear heterozygous 0/1 site. But to be thorough in our analysis and to account for any technical bias, we want to determine if there is a significant difference in the base qualities of the bases that support the reference allele vs. the bases that support the alternate allele.

Before we proceed, we must define our null hypothesis and alternate hypothesis.

-*Null hypothesis:* There is **no** difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

-*Alternate hypothesis:* There **is** a difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

Reference allele base qualities: 20, 25, 26, 30, 32, 40, 47, 50, 53, 60 Alternate allele base qualities: 0, 7, 10, 17, 20, 21, 30, 34, 40, 45

First, we arrange all the observations (base qualities) into a list of values ordered from lowest to highest (reference bases are in bold).

0, 7, 10, 17, **20**, 20, 21, **25**, **26**, **30**, 30, **32**, 34, **40**, 40, 45, **47**, **50**, **53**, **60**

Next we determine the ranks of the values. Since there are 20 observations (the base qualities), we have 20 ranks to assign. Whenever there are ties between observations for the rank, we take the rank to be equal to the midpoint of the ranks. For example, for 20(ref) and 20(alt), we have a tie in values, so we assign each observation a rank of (5+6)/2 = 5.5.

The ranks from the above list are (reference ranks are in bold):

1, 2, 3, 4, **5.5**, 5.5, 7, **8**, **9**, **10.5**, 10.5, **12**, 13, **14.5**, 14.5, 16, **17**, **18**, **19**, **20**

We now need to add up the ranks for the base qualities that came from the reference allele and the alternate allele.

$$ Rank_{ref} = 133.5 $$ $$ Rank_{alt} = 76.5 $$

U is a statistic that tells us the difference between the two rank totals. We can use the U statistic to calculate the z-score (explained below), which will give us our p-value.

Calculate U for each group (n = number of observations in each sample)

$$ U_{ref} = \frac{ n_{ref} * n_{alt} + n_{ref} * (n_{ref}+ 1) }{ 2 } - Rank_{ref} $$ $$ U_{alt} = \frac{ n_{alt} * n_{ref} + n_{alt} * (n_{alt} + 1) }{ 2 } - Rank_{alt} $$

$$ U_{ref} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 133.5 = 21.5 $$ $$ U_{alt} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 76.5 = 78.5 $$

Next, we need to calculate the z-score which will allow us to get the p-value. The z-score is a normalized score that allows us to compare the probability of the U score occurring in our distribution. https://statistics.laerd.com/statistical-guides/standard-score.php

The equation to get the z-score is:

$$ z = \frac{U - mu}{u} $$

Breaking this equation down:

$$ z = z-score $$ $$ U = \text{lowest of the U scores calculated in previous steps} $$

$$ mu = \text{mean of the U scores above} = \frac{ n_{ref} * n_{alt} }{ 2 } $$

$$ u = \text{standard deviation of U} = \sqrt{ \frac{n_{ref} * n_{alt} * (n_{ref} + n_{alt} + 1) }{ 12 } } $$

To calculate our z:

$$ U = 21.5 $$ $$ mu = \frac{10 * 10 }{ 2 } = 50 $$ $$ u = \sqrt{ \frac{10 * 10 *(10 + 10 + 1) }{ 12 } } = 13.229 $$

So altogether we have:

$$ z = \frac{ 21.5 - 50 }{ 13.229 } = -2.154 $$

The p-value is the probability of obtaining a z-score at least as extreme as the one we got, assuming the null hypothesis is true. In our example, the p-value gives us the probability that there is no difference in the base qualities that support the reference allele and the base qualities that support the alternate allele. The lower the p-value, the less likely it is that there is no difference in the base qualities.

Going to the z-score table, or just using a p-value calculator, we find the p-value to be 0.0312.

This means there is a .0312 chance that the base quality scores of the reference allele and alternate allele are the same. Assuming a p-value cutoff of 0.05, meaning there is less than 5% chance there is no difference in the two groups, and greater than or equal to 95% chance that there is a difference between the two groups, we have enough evidence to **reject our null hypothesis** that there is no difference in the base qualities of the reference and alternate allele. This indicates there is some bias and that the alternate allele is less well supported by the data than the allele counts suggest.

No posts found with the requested search criteria.

Created 2015-02-18 05:48:33 | Updated 2015-02-18 05:49:23 | Tags: variantrecalibrator annotations variant-recalibration

Hi, I am working on variant calling for a genotype of rice and I just generated a GVCF file. Here is a snippet of the GVCF file.

Chr1 173923 . C T,

For the VQSR step, I only have the dbSNP file for rice (there are no hapmap or 1000G files) which has following annotation: CHROM POS ID REF ALT QUAL FILTER INFO Chr1 84 rs349504122 T C . . RSPOS=84;dbSNPBuildID=138;SAO=0;VC=snp;VP=050000000000000000020100 Chr1 2223 rs349642167 C T . . RSPOS=2223;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2228 rs350936653 A C . . RSPOS=2228;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2417 rs347865191 G A . . RSPOS=2417;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2546 rs349956414 A C . . RSPOS=2546;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2588 rs352913853 A C . . RSPOS=2588;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2634 rs348515762 C A . . RSPOS=2634;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100

Now, if I use -an QD for VQSR, it wont work because QD is not present in the input vcf file. So, given the two files above, what annotation can I use for VQSR step? Can I use the VQSR step at all?

And this is the setting in which I am using the dbSNP: -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 Are the settings of known/training/truth correct if dbSNP is the only resource?

Thanks in advance, you have been very helpful NB

We're trying to run HaplotypeCaller with some additional annotations. For reference, here's the command we're using:

java -Xmx16g -jar GenomeAnalysisTK.jar -R GRCh37.fa -T HaplotypeCaller -A ReadPosRankSumTest -A RMSMappingQuality -A ClippingRankSumTest -A HaplotypeScore -A MappingQualityRankSumTest -o calls.vcf -I input.bam -nct 24

We're seeing ReadPosRanksum, Clippingranksum, and MQRanksum generated with our calls, but we're not seeing HaplotypeScore. Is HaplotypeScore only generated for certain variants? Does it depend on some other options that we're not using?

Thanks!

Hi,

I known that this question should not post to the GATK forum because the ERROR told me that "Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself." However, I really cant find which step I made this error although I have read many documentations of GATK forum about this error. Could you please give me some suggestions? Much thanks!!

In my VCF file, I find that not all the SNP terms have the same set of annotation, and some annotations cant be found in some SNP terms, like this:

```
chr1 5036777 rs898335 T G 36.74 . AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLE AF=1.00;MQ=37.00;MQ0=0;QD=18.37 GT:AD:DP:GQ:PL 1/1:0,2:2:6:64,6,0
chr1 9507566 rs12742542 T C 33.74 . AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=16.87 GT:AD:DP:GQ:PL 1/1:0,2:2:6:61,6,0
chr1 9507621 rs12755964 G A 37.74 . AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=18.87 GT:AD:DP:GQ:PL 1/1:0,2:2:6:65,6,0
chr1 22376947 rs2473327 A G 40.74 . AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=20.37 GT:AD:DP:GQ:PL 1/1:0,2:2:6:68,6,0
chr1 38061706 rs10908362 G C 32.74 . AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=16.37 GT:AD:DP:GQ:PL 1/1:0,2:2:6:60,6,0
chr1 78317717 rs10782656 G C 36.74 . AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=18.37 GT:AD:DP:GQ:PL 1/1:0,2:2:6:64,6,0
chr1 111457142 rs1282019 A G 35.74 . AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=30.81;MQ0=0;QD=17.87 GT:AD:DP:GQ:PL 1/1:0,2:2:6:63,6,0
chr1 121484153 rs9701684 C G 32.74 . AC=2;AF=1.00;AN=2;DB;DP=5;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=19.48;MQ0=3;QD=6.55 GT:AD:DP:GQ:PL 1/1:2,3:5:6:60,6,0
chr1 121484423 rs7368003 T C 83.03 . AC=2;AF=1.00;AN=2;DB;DP=5;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=28.24;MQ0=1;QD=16.61 GT:AD:DP:GQ:PL 1/1:0,5:5:12:111,12,0
chr1 121484503 rs4898086 T A 311.77 . AC=2;AF=1.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=33.63;MQ0=1;QD=25.98 GT:AD:DP:GQ:PL 1/1:0,12:12:33:340,33,0
chr1 121484591 rs4898109 T A 463.77 . AC=2;AF=1.00;AN=2;DB;DP=20;Dels=0.00;FS=0.000;HaplotypeScore=0.9947;MLEAC=2;MLEAF=1.00;MQ=30.44;MQ0=1;QD=23.19 GT:AD:DP:GQ:PL 1/1:0,20:20:54:492,54,0
chr1 121484599 rs4898111 C G 109.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.555;DB;DP=21;Dels=0.00;FS=0.000;HaplotypeScore=4.9775;MLEAC=1;MLEAF=0.500;MQ=30.78;MQ0=1;MQRankSum=-1.030;QD=5.23;ReadPosRankSum=1.664 GT:AD:DP:GQ:PL 0/1:14,7:21:99:138,0,312
chr1 121484600 rs1825267 T C 100.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.347;DB;DP=21;Dels=0.00;FS=0.000;HaplotypeScore=4.9775;MLEAC=1;MLEAF=0.500;MQ=30.78;MQ0=1;MQRankSum=-1.743;QD=4.80;ReadPosRankSum=1.585 GT:AD:DP:GQ:PL 0/1:14,7:21:99:129,0,310
chr1 121484602 rs74187930 T C 190.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.323;DB;DP=22;Dels=0.00;FS=0.000;HaplotypeScore=4.9775;MLEAC=1;MLEAF=0.500;MQ=31.09;MQ0=1;MQRankSum=1.323;QD=8.67;ReadPosRankSum=-2.306 GT:AD:DP:GQ:PL 0/1:10,11:22:99:219,0,223
chr1 121484650 rs4092774 A G 58.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.620;DB;DP=15;Dels=0.00;FS=0.000;HaplotypeScore=0.9989;MLEAC=1;MLEAF=0.500;MQ=30.73;MQ0=2;MQRankSum=-0.540;QD=3.92;ReadPosRankSum=-0.231 GT:AD:DP:GQ:PL 0/1:10,4:15:87:87,0,222
```

e.g. MQRankSum can be found only in the last 4 terms.

and my Command Line:

```
java -Xmx15g -jar /ifs1/ST_POP/USER/lantianming/HUM/bin/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar
-glm BOTH
-l INFO
-R /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/ucsc.hg19.fasta
-T UnifiedGenotyper
-I /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/recal_03/test.realn_8.recal.bam
-D /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/dbsnp_137.hg19.vcf
-o /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/callsnp/realn_8.recal.bam.vcf
-metrics /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/callsnp/snpcall.metrics
```

What can I do to solve this problem?

Thanks a lot !

Created 2013-04-30 12:45:05 | Updated 2013-04-30 12:45:25 | Tags: variantrecalibrator combinevariants variantannotator annotations

I have a set of VCFs with identical positions in them:

VCF1: 1 10097 . T . 26 . AN=196;DP=1622;MQ=20.06;MQ0=456 GT:DP

VCF2: 1 10097 . T . 21.34 . AN=198;DP=2338;MQ=19.53;MQ0=633 GT:DP

VCF3: 1 10097 . T . 11.70 . AN=240;DP=3957;MQ=19.74;MQ0=1085 GT:DP

VCF4: 1 10097 . T . 15.56 . AN=134;DP=1348;MQ=18.22;MQ0=442 GT:DP

If I use all of them as input for VariantRecalibrator, which annotations will VariantRecalibrator use? Should I instead merge the VCFs with CombineVariants and run VariantAnnotator, before I run VariantRecalibrator?

I'm not sure if the forum is for asking technical questions only or you are allowed to ask for best practices as well. Feel free to delete my question, if it doesn't belong here. Thank you.