Component tests

CMS incorporates the scores from five different tests for selection. These include three haplotype-based tests (iHS, XP-EHH and ΔiHH), one for differentiated alleles (Fst), and a fifth for derived allele frequency (ΔDAF).

iHS and XP-EHH are calculated as described in refs. [1] and [2], respectively. ΔiHH is a modification of the iHS test that is less sensitive to the length of the ancestral haplotype [3]. It is calculated by finding the integrated haplotype homozygosity (iHH) scores of both the ancestral and derived variants (just as in iHS), but taking their difference instead of their log-ratio.

Fst is calculated using the estimator of Weir and Cockerham [4]. ΔDAF is the difference in derived allele frequency in a given population and the mean derived allele frequency from the other populations included in the analysis [3]. It ranges between -1 and 1, with positive scores indicating high-frequency derived alleles in the selected population.

CMS

CMS is used to localize the signal of selection within a given region of the genome and identify the variants that are most likely to be causal. To use the CMS framework in genome-wide scans for regions under selection, see the section on genome-wide extension below.

To calculate the CMS score of a given SNP, we first consider each test separately, and calculate the posterior probability that the SNP is under selection, given its score for the given test. This model assumes that only one SNP is under selection in the region and each SNP is equally likely to be the selected SNP. The likelihoods for the selected and unselected scenarios are calculated based on empirical distributions generated using simulated data from cosi [5]. The unselected scenario is for SNPs that are within 500 kb of the selected SNP but are not under selection.

The posterior probability for test i is calculated using Bayes' theorem

where NSNPis the number of SNPs in the region.

CMS is simply the product of the posterior probabilities for all five tests.

Genome-wide extension

To use CMS in genome-wide scans for regions under selection, we change the unselected scenario to represent SNPs in entirely neutral regions. Furthermore, since we do not have prior distributions for the number of SNPs under selection across the genome, we calculate Bayes factors instead of posterior probabilities for each of the SNPs. The Bayes factor for a given test is calculated by taking the ratio of the likelihoods under a selected scenario and a neutral scenario.

The genome-wide CMS score of a given SNP is the product of the Bayes factors for the five tests.

To identify genomic regions, we call 100 kb windows in which 30% of SNPs have a score above 3. This corresponds to a False Positive Rate of 0.1% in simulations.

References

  1. Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006 Mar ; 4(3) : e72.
  2. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R, Schaffner SF, Lander ES; International HapMap Consortium. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007 Oct 18 ; 449(7164): 913-8.
  3. Grossman SR, Shylakhter I, Karlsson EK, Byrne EH, Morales S, Frieden G, Hostetter E, Angelino E, Garber M, Zuk O, Lander ES, Schaffner SF, Sabeti PC. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science. 2010 Feb 12 ; 327(5967): 883-6.
  4. Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution.
    1984 Nov ; 38(6): 1358-1370.
  5. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005 Nov;15(11):1576-83.