The SiPhy software package comprises several analysis modules and utility programs. All are Java programs executed from the command line. The primary executable file, bin/scaler_0_5.jar, includes the following modules (the -task parameter identifies modules by task number). The general methods are described below and their full description, inputs and outputs are described in the appropriate sections:
|1||Slide a fixed window to testimate local rate of vatiation [ω in Equation (10) of Garber, et al.].|
|10||Estimates local rate of variation for a set of fixed regions (such as exons).|
|7||Estimates substitution patterns bias [π in Equation (10) of Garber, et al.] for each base pair.|
|11||Estimates the substitution patterns for a set of fixed regions by calculating the sum of the scores produced by Task=7.|
|16||Estimates a conservation score for a given sequence motif using a simple generalization of the substitution pattern estimation.|
Java 1.5 or later: The SiPhy software requires Java 1.5 or later. It was written and tested using Java 1.5.
As a typical constraint estimation tool, SiPhy requires two main inputs: a multiple alignment file and neutral "average" evolutionary model file.
SiPhy accepts three multiple alignment formats:
Multiple alignment files for well-studied phylogenies can be found on public websites such as the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu).
Mulitple sequence alignments of genome sequence allow for missing (or inserted) sequence in one or several species to align as "gaps". Such gaps are abundant and may be the result evolutionary insertion and deletions events that, either by random drift or due to selective pressure, become fixed in a species, or they may be due to artifacts of the multiple alignment process. Further, genome assemblies vary in their completenes, and contain in many "gapped" regions of missing sequence. In this version, SiPhy treats both missing data and alignment gaps by ignoring the species with missing or gapped alignments from the computations.
The model file specifies the parameters of the probabilistic model describing the evolutionary process: A matrix describing the rate at which each nucleotide mutates into another, the equilibrium distribution of bases, and the phylogentic tree topology connecting the species under study.
SiPhy modules estimate the local divergence of different statistics from the neutral "average" evolutionary model provided by the model file. The modules work with the output of two programs commonly used to generate such models:
Two types of genomic sequence are understood to have evolved under no or very weak selection: (1) 4D-sites, the four-fold degenerate third codon position of protein-coding genes, and (2) Ancestral Repeats (AR), mobile elements that are ancestral to all sequences under study. Several software programs, such as Phast and PAML, estimate the average substitution rate from the extracted alignment crresponding to these sequences. The resulting model file includes the scaled phylogenetic tree with edge length proportional to the estimates average substitutions per site, a rate matrix Q representing the rate of substitutions between the four nucleotides, and a stationary distribution.
Following is an example model file. The identifiers in the TREE field of the model file must match identifiers in the multiple alignment file.
BACKGROUND: 0.2 0.3 0.3 0.2 RATE_MAT: -1.0860163084198684 0.25225991842162027 0.6937865599085344 0.13996983008971356 0.16817327894774686 -0.9225558555676914 0.2866374459445824 0.46774513067536205 0.4625243732723563 0.2866374459445824 -0.9192377889311856 0.17007596971424688 0.13996983008971356 0.701617696013043 0.2551139545713703 -1.0967014806741269 TREE: (((((hg18:0.006,panTro2:0.007):0.024,rheMac2:0.036):0.08,tarSyr1:0.152):0.009,(micMur1:0.099,otoGar1:0.143):0.037):0.004,((mm9:0.089,rn4:0.098):0.239,speTri1:0.159):0.028):0.02
SiPhy provides two ways to estimate the local rate of variation [ω in Equation (10) of Garber, et al.]: by sliding a fixed-size window or by providing a set of annotations in BED format. To estimate variation for the whole genome or a generic distribution of constraint in the alignment, use a fixed-size window. To estimate variation for specific genomic areas, use annotations. The result is a tab-delimited output file with the estimated rate of variation for each window or element. The paper discusses only the fixed-size window method.
Estimate the local rate of variation [ω in Equation (10) of Garber, et al.] by sliding a fixed-size window.
java -jar bin/scaler_0.5.jar -task 1
The output is a tab delimited file with the following columns:
/seq/mgarber/tools/runJava edu.mit.broad.prodinfo.conservation.TreeScaler -task 1 -ref hg17 \\ -in ENm002_compressed.fa -window 12 -mod combined.AR.mod -out ENm002_w12.omegas -windowOverlap 11 -minTreeLenght 0.5
This command generated the data describbed in Garber et al. for Encode region ENm002. Both the input and results of the command are available in the data directory.
Estimate the local rate of variation [ω in Equation (10) of Garber, et al.] providing a set of annotations in BED format. The paper discusses how to estimate the local rate of variation using fixed-size windows; it does not discuss this method.
java -jar bin/scaler_0.5.jar -task 10
The output is a tab delimited file with the following columns
java -jar bin/scaler_0.5.jar -task 10 -alignment chr22.maf -in sample_data/tug1.exons.bed -mod sample_data/autosomal.mod -format MAF \\ -ignore danRer5,oryLat2,gasAcu1,fr2,taeGut1,tetNig1,xenTro2,anoCar1,galGal3,ornAna1,monDom4 > sample_data/tug1.exons.exon.omegas
This command computes the change in substitutuion rate (ω) for each of the three exons of the TUG1 non-coding RNA. The model file and output can be found in the data directory. The multiple alignment for chromosome 22 can be downloaded from UCSC genome browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz44way/maf/chr22.maf.gz . The -ignore flag used here restricts the program to eutherian mammals rather than using all vertebrate species in the alignment.
SiPhy estimates substitution patterns [π in Equation (10) of Garber, et al.] per column, essentially using a fixed-size window of length 1. SiPhy then sums the resulting scores to estimate substitution patterns across larger fixed-size windows.
Estimate substitution patterns [π in Equation (10) of Garber, et al.] per column.
java -jar bin/scaler_0.5.jar -task 7
The output is a tab delimited file with the following columns:
java -jar bin/scaler_0.5.jar -task 7 -ref hg17 -in ENm002_compressed.fa \\ -mod combined.AR.mod -minTreeLength 0.5 -out ENm002.pi
This command will estimate the stationary distribution (π) and its log-odds ratio score for each column in the alignment. Both input files and the output of this command can be found in the data directory.
After estimating substitution patterns per column as described above, use this method to sum the resulting scores to estimate substitution patterns across a larger region or a fixed-sized window.
java -jar bin/scaler_0.5.jar -task 11
The output is a BED file contaiig all windows with score equal to the sum of the individual site log-odds ratio score.
As describbed in Garber, et al.] Siphy can define conserved elements whose bases are evolving in unexpected subsitituion patterns using a hidden Markov model. Siphy supports two commands, the first to find the Viterbi path, the second to estimate the posterior probability that a base is evolving under the biased substitution pattern model rather than the neutral model. Both commands have the same parameters and only differ in their task number.
java -jar bin/pihmm.jar -task 1 (or -task 2 to compute posterior probabilities )
java -jar bin/pihmm.jar -task 1 -gamma 0.08 -l 10 -mod sample_data/combined.AR.mod -in sample_data/ENm002_compressed.fa -ref hg17 -out ENm002.viterbiand
java -jar bin/pihmm.jar -task 2 -gamma 0.08 -l 10 -mod sample_data/combined.AR.mod -in sample_data/ENm002_compressed.fa -ref hg17 -out ENm002.viterbi
Computes the viterbipath and posterior probabilities using the π HMM on the encode region ENm002. The input and output used in this example can be downloaded from the data directory.
Estimates a conservation score for a given sequence motif using a simple generalization of the substitution pattern estimation. The algorithm is similar to the one described by Moses, et al. in MONKEY (Genome Biology 2004, 5:R98). This method was not discussed in Garber, et al. but was used in Guttman, et al. (Nature 2009, 458:223-227) to rank lincRNAs by their potential to be regulated by known transcription binding sites based on occurrence of conserved binding sites.
SiPhy estimates the conservation score by sliding a fixed-size window equal in length to the motif. You can analyze a single genomic interval by specifying the start and end coordinates of that interval (-start and -end parameters) or analyze a set of coordinates by specifying a set of annotations in a BED file (-in parameter). The input multiple alignment files must be in the MAF file format.
java -jar bin/scaler_0.5.jar -task 16
The ouput is one BED file per motif (the file name being the motif name as it appears in the pwm input file) with the locations that pass the minimum seed score, the orientation of the entry reflects the best orientation of the motif match. The score is sum of the log-odds ratio of the likelihood that each column of the element was generated by the evolutionary model with stationary distribution (π) equals to the corresponding column in the motif or by the null model.
java -jar bin/scaler_0.5.jar -task 16 -pwm p53.pwm -indir multiz44way/ -chr 22 -start 29695034 -end 29705980 -mod autosomal.mod \\ -ignore danRer5,oryLat2,gasAcu1,fr2,taeGut1,tetNig1,xenTro2,anoCar1,galGal3,ornAna1,monDom4 -minSeedScore 2 -outdir outdir
This command will scan the TUG1 non-coding RNA loci for conserved instances of the transfac P53 motif. The -minSeedScore = 2 improves performance by scanning the reference first, only running the maximum likelihood estimation for windows where the log-odds ratio of the likelihood of the reference (hg18) bases being generated by the motif vs the neutral model specified is grater than 2. All input files, except for the directory with the genome wide alignments can be found in the data directory, to run this command it is sufficient to create a multiz44way directory and download the chr22 alignment from the UCSC browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz44way/maf/chr22.maf.gz the -ignore flag restricts the program to eutherian mammals rather than using all vertebrate species in the alignment.
Generates an index file for a MAF file.
java -jar mafutils.jar -task 3 < alignment_file > alignment_file.index
The MAF index file must have the same name as the MAF file name and the .index suffix.
Converts a PAML output file to the file format used by Phast (and SiPhy). Requires PERL
Command: perl extract_model_from_paml_out.pl