(Please send questions about the CMS code to ilya@broadinstitute.org)

Running the CMS pipeline

1. Set up your environment

To run CMS, you need to make sure that several auxiliary tools are accessible.

If you have an account on Broad Institute machines:

On Broad Institute machines, you can type

wget http://asgard/~ilya/broadenv.sh
source broadenv.sh

to set up the environment. To set this up automatically every time you log in, you may want to include the contents of broadenv.sh in your ~/.my.cshrc file.

For running any heavy computation, make sure you are NOT on one of the Broad's login servers (tin, cu, or ni) -- either type ish to log in to an interactive LSF host or use isub to submit to the interactive queue.

If you do not have an account on Broad Institute machines:

On other machines, you will need to set up:

Subversion 1.6 or later (if you have a Broad Institute account and want to use subversion to get incremental updates)
Python 2.6 later (but before 3.0). Add the .. directory to your PYTHONPATH variable. Make sure your Python installation includes numpy
Java 1.6 or later
Graphviz 2.16 or later

2. Checking out and updating the repository

If you don't have access to the Broad SVN reposity, download the file cmsRelease.zip.

If you do have SVN access, it's better to do a checkout as described below, so you can get incremental updates.

To do a checkout:

In an empty directory, type

svn co --depth files https://svn.broadinstitute.org/sabetilab/trunk
source trunk/updateCMS.sh

Afterwards, to update to the latest version of the code, change to the same directory (the one containing trunk/) and type

svn update trunk/updateTrunk.sh
source trunk/updateTrunk.sh

All relative paths in this document are given relative to trunk/Temp.

3. Setting up your data

Described below are the data files you will need to run CMS analyses. For each piece of data we first describe the semantic content - what information the data contains, any caveats - and then describe the details of the file formats.

All relative paths are relative to the trunk/Temp directory.

3.1 Phased genotypes

Semantic content: You'll have data for one or more populations. For each population, you'll have a sample of haplotypes from that population. Each haplotype gives allelic state for a list of SNPs. Missing data is not yet supported: for a given population, there is a fixed set of SNPs typed in that population; each haplotype in that population must give an allelic state for each of these SNPs. Not all SNPs must be typed in all populations in the input data; however, only SNPs typed in all populations are currently used for analysis. Unphased genotype data is also not yet supported; fully phased haplotypes are required.

Format details: The information about the typed SNPs, and the allelic state of each haplotype at each typed SNP, is represented in the format used by HapMap II, specifically the all/ version of that data. You have a set of populations (each designated by a text string, e.g. CEU) and a set of chromosomes. For each population and each chromosome, you need two files: a legend file and a phased file. (The *sample.txt files that are also part of HapMap are not currently needed.) The legend file lists the SNPs typed in this population on this chromosome; the phased file gives for each haplotype its allelic state at each SNP. Currently, each SNP must be assigned a unique rs number, of the form rsNNNN, which must be listed in the legend file. The rs numbers may be fictitious as long as each SNP has a unique one and they are consistent across different populations. The file names must follow the pattern genotypes_chrNN_POP_r21_nr_fwd_legend_all and genotypes_chrNN_POP_r21_nr_fwd_phased_all, with NN replaced by chromosome number and POP replaced by the population ID.

3.2 Ancestral allele information

Semantic content: For each SNP, we need to know the ancestral allele at that SNP. Only SNPs for which this is known can be used for analysis.

Format details: Create a directory with one file per chromosome, named according to the pattern genotypes_chrN_r21_nr_fwd_chimp_consensus_legend+freq with N replaced by the chromosome number. The file has no header line, and has the columns: rs, position, ancestral, derived; plus one column for each population, giving the number of haplotypes with the derived allele in that population (the numbers are currently unused so can be set to any integer). Examples can be found in the checkout in ../Data/Ilya_Data/hapmap2/chimp/.

3.3 Genetic map

Semantic content: For a list of genomic positions, the probability of recombination between that position and the start of the chromosome, per generation.

Format details: Make a directory where for each chromosome there is one file named genetic_map_chrN_bMM.txt with N replaced by chromosome number and MM replaced by genome build number (35 for hg17, 36 for hg18, 37 for hg19). The two columns each file must have are 'position' and 'GeneticMap(cM)'. Examples can be found in the checkout in ../Data/Ilya_Data/hapmap2/recombRates.

4. Running CMS

Change to trunk/Temp. The convention is that you work from within the Temp/ directory, put your code under ../Operations, put your data under ../Data, and use relative paths whenever possible. Then if you check things in and then check them out somewhere else, all paths will still work correctly since they're all relative to Temp/. All relative paths in this documentation are given relative to Temp/.

The top-level script for running CMS analyses of data in HapMap 2 format is Ilya_Temp/pyHapMapCMS.py. Copy this script so you can modify it to point it to your own data.

You will need to edit the call to DefineRulesTo_hapmapCMS() to point to your data. The parameters to DefineRulesTo_hapmapCMS() are documented in Operations.Ilya_Operations.localize.hapmapCMSDefineRulesTo_hapmapCMS().  In particular, make sure to indicate the correct genome build parameter (e.g. genomeBuild = 'hg19').

You can also edit the runtime parameters that control how the pipeline is run, by editing the pr.run() call at the end of the script. See the documentation of Operations.Ilya_Operations.PipeRun.python.PipeRun.run() method for a description of the arguments. By default, if you run the script without command-line arguments, it will write out the pipeline it plans to run (but not actually run it) to the directory ./pipeline_pyHapMapCMS. On Broad Institute machines, you can make this web-readable by running Ilya_Temp/putonweb.sh pipeline_pyHapMapCMS, then access http://broadinstitute.org/~yourusername/pipeline_pyHapMapCMS in your web browser. If you run the script with any argument, this will run the pipeline. If the argument contains the string loc it will run locally, otherwise will farm out to LSF.

After running the pipeline, you can find the output for each region under the directory you gave as the outDir parameter to DefineRulesTo_hapmapCMS(), under the regions/ subdirectory of that directory. Under regions/, for each region, look in the regions/regionName directory.