MutSig FAQ

Q: "What are the differences between MutSig 1.0, 1.5, 2.0, and CV?"

A: MutSig relies on several sources of evidence in the data in order to estimate the amount of positive selection a gene underwent during tumorigenesis.

The three main sources are:

     1) Abundance of mutations relative to the background mutation rate (BMR)

     2) Clustering of mutations in hotspots within the gene

     3) Conservation of the mutated positions (i.e. did the mutation happen at a position that is conserved across vertebrates?)

The first line of evidence, Abundance, goes into the core significance calculation performed in all versions of MutSig. In MutSig1.0, this is simply called "p". MutSig1.0 assumes a constant BMR across all genes in the genome and all patients in the patient cohort. In MutSig1.5, this is also called "p", but MutSig1.5 uses information from synonymous mutations to roughly estimate gene-specific BMRs. Later versions of MutSig (MutSigS2N and MutSigCV) have increasingly sophisticated procedures for treating the heterogeneity in per-gene, per-patient, and per-context BMRs, but they are all answering essentially the same question about Abundance of mutations above the background level.

The other lines of evidence, Conservation and Clustering, are examined by a separate part of MutSig (historically called "MutSig2.0") that carries out many permutations, comparing the distributions of mutations observed to the null distribution from these permutations. The output of this permutation procedure is a set of additional p-values: p_clust is the significance of the amount of clustering in hotspots within the gene. p_cons is the significance of the enrichment of mutations in evolutionarily conserved positions of the gene. Finally, p_joint is the joint significance of these two signals (Conservation and Clustering), calculated according to their joint distribution. The reason for calculating p_joint is to ensure there is no double-counting of the significance due, for example, to clustering in a conserved hotspot.

Combining all three lines of evidence: In order to make a full accounting of the signals of positive selection in a given gene, we combine all three lines of evidence. This is done by using the Fisher method of combining p-values. The two p-values combined are the "p" (or "p_classic") from the analysis of mutation Abundance (performed by MutSig 1.0/1.5/S2N/CV), and the p_joint from the analysis of Conservation and Clustering (performed by MutSig2.0).