How do I run MutSig?
To run MutSigCV locally, you can download it from this site, and follow the directions below.
MutSigCV will also be available soon from GenePattern. At that point, you will be able to run it on your own data using the Broad GenePattern server.
The following three data files are required:
This is a list of all the mutations to be analyzed, with all patients concatenated together. It should be a tab-delimited file with a header row. The positions of the columns are not important, but the header names are. The program will look up the information it needs based on the names of the headers.
The file can be a MAF file, and some of the columns come directly from the standard columns of the MAF format. However, there are a few additional columns required.
The four columns required by the program are:
The standard set of categories we have used for many sequencing projects is as follows:
To compute the "categ" column, information from the reference genome is required: specifically, the identity of the nucleotides directly on each side of the mutation site. Together with that information and the Variant_Classification, Reference_Allele, and Tumor_Seq_Allele1+2 columns, "categ" can be computed.
Starting with MutSigCV version 1.3, an integrated preprocessing module assists with the calculation of these categ numbers, and also enables the automated determination of the optimal set of categories to be used for a given dataset.
The coverage table tells how many nucleotides were sequenced to adequate depth for mutation calling. Coverage is tabulated for each patient, and for each gene. It is also broken down by "categ" and "effect" (as listed above). Again, "effect" can be either "noncoding" (this refers to the flanking territory outside exons), "nonsilent" (this refers to bases which, when mutated, yield a change in the protein sequence--including splice-sites), or "silent" (this refers to bases which give a synonymous change when mutated). Note, some coding positions can contribute fractionally to the "silent" and "nonsilent" zones, in a ratio of 1/3 to 2/3 (or vice versa), depending on the consequences of mutating to each of the three possible alternate bases.
The columns required in the coverage table are:
We recognize that detailed coverage information is not always available. In such cases, a reasonable approach is to carry out the computation assuming full coverage. We have prepared a file that can be used for this purpose: it is a "full coverage" file, or more accurately a "territory" file: the only information it contributes is a tabulation of how the reference sequence of the human exome breaks down by gene, categ, and effect. To download this file, see the section below about MutSigCV v1.3.
This table lists genomic parameters for each gene being analyzed. They are called covariates because they co-vary with mutation rate. They will be used to calculate distances between pairs of genes in a "covariate space" in order to determine the nearest neighbors of each gene, in order to pool information among them about the local background mutation rate (BMR).
The columns of this file are:
The covariates table provided in the Example Data has proven useful for analyzing many cancer types. The table contains one value per gene for: (1) global expression, derived from RNA-Seq data and summed across the 91 cell lines in the CCLE (Barretina et al.). (2) DNA replication time (from Chen et al.). (3) the HiC statistic, a measure of open vs. closed chromatin state (from Lieberman-Aiden et al.).
Running the algorithm
If you have a license for Matlab, you can run MutSigCV from its source code file: MutSigCV.m
Open Matlab and type the following command at the ">>" Matlab prompt.
If you do not have a license for Matlab, you can run the compiled version of MutSigCV using the free Matlab MCR:
run_MutSigCV.sh <path_to_MCR> mutations.maf coverage.txt covariates.txt output.txt
The algorithm will load the three input files, process them using the MutSigCV algorithm, and then finally write the output table to the file 'output.txt'.
Be sure to replace mutations.maf, etc with the actual paths to the input files.
Running MutSigCV when all you have is a MAF file
Starting with v1.3 of the code, MutSigCV has a preprocessing module that takes care of organizing the "categ" and "effect" information. This makes it easy to run MutSigCV when all you have is a MAF file.
Then run the program with the following six arguments (instead of four):
From the Matlab prompt, that is:
or using the compiled version of MutSigCV and the free Matlab MCR:
run_MutSigCV.sh <path_to_MCR> my_mutations.maf exome_full192.coverage.txt gene.covariates.txt my_results mutation_type_dictionary_file.txt chr_files_hg19
Please note, currently the preprocessing module supports data from the human exome, in builds hg18 or hg19. Future work will enable use of other builds (e.g. mm9, canFam2, etc.)
The data used for the TCGA Lung Squamous paper is available here:
MutSigCV_example_data.1.0.1.zip (same data as v1.0, but renames files and includes the expected output file)
Unzip the archive to yield the following four files:
The input data can be processed by invoking the following command from the Matlab ">>" prompt:
NOTE ABOUT VERSIONS
The list of significantly mutated genes found by MutSigCV1.0 is the same as that published in the LUSC manuscript, with the exception of one additional gene, FBXW7, which was not initially reported as significantly mutated, but now is. Also please note that the per-gene p-values differ due to the slightly different variants of the algorithm used. In particular, the dynamic range of MutSigCV1.0 now ends at ~10-16.