What is Mutsig?
MutSig stands for "Mutation Significance". MutSig analyzes lists of mutations discovered in DNA sequencing, to identify genes that were mutated more often than expected by chance given background mutation processes.
For more information, see Lawrence, M. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214-218 (2013). link to paper
The input data to MutSig is lists of mutations (and indels) from a set of samples (patients) that were subjected to DNA sequencing, as well as information about how much territory was covered in the sequencing. MutSig was originally developed for analyzing somatic mutations, but it has also been useful in analyzing germline mutations. MutSig builds a model of the background mutation processes that were at work during formation of the tumors, and it analyzes the mutations of each gene to identify genes that were mutated more often than expected by chance, given the background model.
The cartoon below illustrates the overall concept. On the left is a set of genomes (or exomes), each from sequencing the tumor cells of a different cancer patient. Genes are indicated as colored bands, and somatic mutations are indicated by red triangles. First, tumors are aggregated together and mutations are tallied, and then a score and p-value are calculated for each gene. A significance threshold is chosen to control the False Discovery Rate (FDR), and genes exceeding this threshold are reported as significantly mutated.
A critical component of MutSigCV is the background model for mutations, the probability that a base is mutated by chance. Patients being analyzed do not all have the same background mutation rate, or the same spectrum of mutations. Similarly, not all regions of the genome (or exome) have the same background mutation patterns.
MutSig has been evolving since the early days of clinical sequencing, and several versions have been in use:
MutSig1.0 assumed a constant background mutation rate (BMR) across the genome.
MutSig1.5 implemented a rudimentary estimate of per-gene background mutation rates from analyzing the silent (synonymous) mutations of each gene and the rough expression level of the gene.
MutSig2.0, although named similarly to the above versions, is actually a rather different animal. While the versions listed above consider the *abundance* of mutations above background, this part of MutSig looks at two additional independent signals of positive selection in genes: the *clustering* of mutations in hotspots, and the functional impact of the mutations, which can be estimated in a number of ways (PolyPhen, SIFT, CHASM, Mutation Assessor, etc.), or even simply from the *conservation* of the sites--that is, how conserved they were during vertebrate evolution. These two signals are then combined with each other and with the results of MutSig 1.5 to yield a final measure of significance that takes all three signals (Abudance, Clustering, and Conservation) into account.
MutSigS2N was a rudimentary precursor of MutSigCV, used in a few interim projects before MutSigCV was developed.
MutSigCV is our most current version of the algorithm. The "CV" stands for "covariates". MutSigCV starts from the observation that the data is very sparse, and that there are usually too few silent mutations in a gene for its BMR to be estimated with any confidence. MutSigCV improves the BMR estimatation by pooling data from 'neighbor' genes in covariate space. These neighbor genes are chosen on the basis of having similar genomic properties to the central gene in question: properties such as DNA replication time, chromatin state (open/closed), and general level of transcription activity (e.g. highly transcribed vs. not transcribed at all). These genomic parameters have been observed to strongly correlate (co-vary) with background mutation rate. For instance, genes that replicate early in S-phase tend to have much lower mutation rates than late-replicating genes. Genes that are highly transcribed also tend to have lower mutation rates than unexpressed genes, due in part to the effects of transcription-coupled repair (TCR). Genes in closed chromatin (as measured by HiC or ChipSeq) have higher mutation rates than genes in open chromatin. Incorporating these covariates into the background model substantially reduces the number of false-positive findings.