29 Mammals Project - Supplementary Info

This page contains all supplementary data for Lindblad-Toh et al. “A high-resolution map of evolutionary constraint in the human genome based on 29 eutherian mammals

The following data can also be viewed at these locations:

For general questions, contact: Kerstin Lindblad-Toh

For questions about a particular dataset, please contact the individual identified below for the dataset.

Constrained Elements (SiPhy, -omega & pi)
Summary: These files contain the lists of constrained elements. For each 12-mer in the human genome a measure of constraint was scored using SiPhy (see reference below), both as a rate-based score (omega), and a measure that includes biased substitution patterns (pi). Those falling in annotated Ancestral Repeats were used as a background. An empirical cutoff score was set corresponding to 10% FDR, and all 12-mers above this score were considered significant. Overlapping significant 12-mers were clustered to yield larger elements. 

Files (warning: LARGE):

Format: ‘Chromosome Start End Lods-score Branch-length’
Format note: The coordinates are 0-based, inclusive (meaning the End position is considered part of the element), and on hg18

Contact: Or Zuk, Manuel Garber

Reference: Garber, M. et al., Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics 25, i54-62, doi:btp190 [pii] 10.1093/bioinformatics/btp190 (2009).

Heights (omega, pi)
Summary: These files contain base-level measure of constraint scored using SiPhy (see reference above), both as a rate-based score (omega) and a measure that includes biased substitution patterns (pi).

Files (warning, VERY LARGE):

Format: ‘position log_odds_score’

Format: ‘postion %A %C %G %G log_odds_score’

Format note: coordinates on hg18

Contact: Manuel Garber

Protein-coding Exons
Summary: A list of identified previous annotation and (Reference
annotation) novel conserved exons (Congo). Exons were identified
using a version of CONGO (previously developed for the Drosophila
genomes, see reference below) enhanced to handle mammalian exon
prediction. The enhancements include a semi-Markov feature to model
the short length distribution of mammalian exons, a synteny feature
for recognizing duplicated regions, and an alternative training
function to improve accuracy when performing an unbalanced prediction
task (only ~1.5% of the human genome is protein-coding).

File:

Format: GTF
Format note: coordinates on hg18

Contact: Mike Lin

Reference: Lin, M. F. et al. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Res 17, 1823-1836, doi:gr.6679507 [pii]
10.1101/gr.6679507 (2007).

Synonymous Constraint Elements
Summary: Identified coding regions with a very low synonymous substitution rate – indicating additional sequence constraints beyond the amino acid level. The Synonymous Constraint Elements (SCEs) are defined at three different resolutions (9-, 15-, and 30-codon). There is also a bedGraph track for the local estimate of the synonymous substitution rate (lambda_s). Also available at: http://compbio.mit.edu/SCE/

File (warning: LARGE):

Files in archive:

  • SCE9.hg18.bed.gz
  • SCE15.hg18.bed.gzSCE30.hg18.bed.gz
  • lambda_s_ORF.hg18.bedGraph.gz

File formats: BED, bedGraph
Format note: coordinates are on hg18

Contact: Mike Lin

RNA Structures
Summary: The list of candidate predictions for structural RNA families. EvoFold structural predictions were based on a 31-way subset of the genome-wide 44-way multiZ alignment (consisting of 28 of the 29 eutherian mammals, together with opossum, chicken, and tetraodon as outgroups) and clustered into candidate families using the novel EvoFam algorithm. This data, as well as the complete set of structure predictions from the EvoFold screen can be downloaded in bulk or browsed through a UCSC Genome Mirror from the following web‐site: 

http://moma.ki.au.dk/~jsp/mammals/

In addition, individual families are listed and annotated in the following reference and its supplement.

File:

Files in archive:

  • Genome_wide_prediction_set
  • Genome_wide_with_paralogs_prediction_set
  • UTR_with_paralogs_prediction_set
  • data_format.txt

Format: described in the file: data_format.txt
Format note: coordinates are hg18

Contacts: Brian Parker, Stefan Washietl

Reference: Parker, B. J. et al. New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes. Genome Research (2011).

Constraint Structure in Promoters
Summary: A list of local maxima identified from the smoothed pi-scores in the core promoters of genes.

File:

Format: ‘Chromosome Start End Score’
Format note: Coordinates are 1-based and exclusive (meaning the End base is not included in the peak position). All positions are at a single base, and are on hg18.

Contact: Evan Mauceli

Motif Instances
Summary: A list of instances of identified regulatory motifs. A motif catalog was built from TRANSFAC, Jaspar, and Protein Binding Microarrays using a method similar to that described in the reference below, with extensions for position frequency matricies. Motif instances were identified genome-wide using a FDR of 60%.

File (warning: LARGE):

Format: Motif-name Chromosome Start End Strand
Format note: coordinates are 1-based, inclusive (meaning the End position is considered part of the element), and on hg18

Contact: Pouya Kheradpour

Reference: Kheradpour, P., Stark, A., Roy, S. & Kellis, M. Reliable prediction of regulator targets using 12 Drosophila genomes. Genome Res 17, 1919-1931, doi:gr.7090407 [pii] 10.1101/gr.7090407 (2007).

Chromatin Mark Data
Summary: ENCODE segmentation of hg18 into chromatin states for each of nine human cell types. States were learned using a Hidden Markov Model that computationally integrated ChIP-seq data into fifteen states associated with different types of functionality.

This data is available from UCSC at:
http://genome-preview.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2563118&c=chrX&g=wgEncodeBroadHmm

A similar segmentation based on the CD4T cell line is also provided in the file:

Files (warning : LARGE):

Files in archive:

  • wgEncodeBroadHmmGm12878HMM.bed (modelling in GM12878 cells)
  • wgEncodeBroadHmmH1hescHMM.bed (modelling in H1-hESC cells)
  • wgEncodeBroadHmmHmecHMM.bed (modelling in HMEC cells)
  • wgEncodeBroadHmmHsmmHMM.bed (modelling in HSMM cells)
  • wgEncodeBroadHmmHuvecHMM.bed (modelling in HUVEC cells)
  • wgEncodeBroadHmmHepg2HMM.bed (modelling in HepG2 cells)
  • wgEncodeBroadHmmNhekHMM.bed (modelling in NHEK cells)
  • wgEncodeBroadHmmK562HMM.bed (modelling in K562 cells)
  • wgEncodeBroadHmmNhlfHMM.bed (modelling in NHLF cells)

File Format:  BED

ContactJason Ernst

References:   Ernst J and Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology 2010 Jul 25;28:817-825.

Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M and Bernstein BE. Systematic analysis of chromatin state dynamics in nine human cell types. accepted.
 

Accounting for Conserved Elements
Summary:  A list of each conserved element (omega), the chromatin state it resides in, and, if applicable, any genic annotation it overlaps, as well as any overlapping motif instances.

File (warning: LARGE):

File format:

Column1: 'Element:'
Column 3-4*: 'chromatin: 'cell-type:chromatin-state-' for 9 cell types. *if a conserved element overlaps multiple chromatin states, then the consecutive states appear in consecutive columns
Column 5-6 (if applicable): 'gencode:' annotation
Column 7-8 (if applicable); 'motifs:' overlapping motifs for this element

Chromatin state numbers and candidate annotations:
State 1 <D0> Active Promoter
State 2 <D0> Weak Promoter
State 3 <D0> Inactive/Poised Promoter
State 4,5 <D0> Strong enhancer
State 6,7 <D0> Weak enhancer
State 8 <D0> Insulator
State 9 <D0> Transcriptional transition
State 10 <D0> Transcriptional elongation
State 11 <D0> Weak transcribed
State 12 <D0> Polycomb-repressed
State 13 <D0> Heterochromatin; low signal
State 14,15 <D0> Repetitive/Copy Number Variation

ContactEvan Mauceli

Associated GWAS SNPs overlapping constraint
Summary: Overlap between genomic variants associated with clinical phenotypes and mammalian conservation. GWAS data was downloaded from the NHGRI database on 5/30/11.

File:

File format:  Microsoft Excel Workbook

ContactLuke Ward
 

Positively Selected Codons
Summary: This archive contains the main data files and backing data for the analysis identifying positively selected codons. This data and updates are available for download from here:
http://www.ebi.ac.uk/~greg/mammals/

File (warning: VERY LARGE):

Files in archive - please see README<LINK> file for additional file descriptions:

  • Pol_sel_score.bed
  • Overall_dN_dS.bed
  • Sites.bedGraph
  • mammals_e57_sitewise_tables.Rdata
  • web/ (directory)

File formats: BED, bedGraph, Rdata
Format note: coordinates are on hg18

Contact: Gregory Jordan

Exapted Repeats
Summary: A list of exapted elements identified as described in the following reference.

File (warning: LARGE):

File format: BED
Format note: coordinates are on hg18
.
Contact: Craig Lowe

Reference: Lowe, C. B. & Haussler, D. 29 mammalian genomes reveal novel exaptations of mobile elements for likely regulatory functions in the human genome. In preparation (2011).

Human and Primate Accelerated Regions
Summary: Lists of human accelerated regions (HARs) and primate accelerated regions (PARs). Regions with accelerated substitution rates in either lineage were identified by first defining candidate elements using the phastCons program (not including the lineage of interest) and then scoring those elements for accelerated substitution rates in the subtree (human or primate) of interest.

Files

Format: BED
Format note: coordinates are on hg18

Contact: Katherine Pollard