This website allows you to browse known and discovered motifs for the ENCODE TF ChIP-seq datasets.
Citation: Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments
Pouya Kheradpour and Manolis Kellis
Nucleic Acids Research, 2013 December 13, doi:10.1093/nar/gkt1249
Send any questions/comments to Pouya Kheradpour.
- Each experiment is put into a "factor group" on the basis of ChIP TF and its known motifs with the intention to group factors with very similar motifs. Known motifs are assigned to the factor group using the same critera.
- The navigation is done in the left frame. For each factor group it indicates the the number of known motifs, discovered motifs, and experimental datasets. Clicking on the headers will resort the table.
- The "discovered motifs" for each factor group are the top 10 (in terms of enrichment in their discovery dataset using the Intergenic background) where no two are more than 0.75 similar to each other (this prevents very similar variants of the same motif from being taken).
- Enrichments are computed by taking the fraction of motif instances that are inside the bound regions and dividing that by the fraction of shuffle motif instances inside (where the bound regions are filtered against the background regions, defined below). They are also corrected for small counts by using a confidence interval (with Z=1.5) around each fraction and taking the extreme which leads to the enrichment closest to 1.
- Clicking on a factor group will change the middle and right frames.
- The middle frame shows the known and discovered motifs. Clicking on the name of the motif will highlight it in the heatmap. Clicking on the logo will provide the PFM (position frequency matrix). For all PFMs, see motifs.txt below.
- The right frame is a heatmap indicating:
- Top; in white/black color scale: The similarity (in correlation) between the known/discovered motifs. This is computed directly from the PFMs without using the genome at all.
- Below; in white/red scale: the enrichment of each of the motifs. Enrichments are not available for motifs with too little information content or for which control motifs could not be created.
The enrichments are for three different background regions as the three triangles (all intergenic/intronic, only +/-2kb from TSS and outside +/- 2kb, in top, left and right, respectively). All three backgrounds exclude coding, 3'UTRs, and repetitive regions. The indicated number is for all Intergenic.
Experiments names are systematically named using a name mapping scheme.
- Motifs are matched to the genome using a p-value of 4^-8 (threshold for each motif computed using TFM-PVALUE). A custom program is used to do the actual matching.
- motif-disc.pdf (13M): A printable version of the web page with logos and heatmaps for each factor group.
- encode-motifs-v1.3.tar.gz (43K): software to (1) compute enrichments and produce heatmaps on custom data and (2) perform unified motif discovery. See README contained within for more information.
- The following bulk datafiles are available:
- matches.txt.gz (962M): the motif matches which can be used for carrying out custom analyses (all coordinates are in hg19).
- The file is 1 indexed, end inclusive.
- The strand of the motif may not match the logo displayed on this website (which may be flipped to match others in the factor group). See the motifs.txt file below for the strand used to produce these matches.
- matches-with-controls.txt.gz (11G) contains all the matches as matches.txt.gz, but also contains matches for the shuffled control motifs (indicated with _C#).
- matches-with-controls-0.3.txt.gz (1.3G) motif instances at 0.3 confidence level based on conservation in closely related species (Kheradpour, et al. 2007; Lindblad-Toh, et al. 2011).
- back-regions.txt.gz (29M): the background regions used for the analysis. This file is also 1 indexed and end inclusive.
- motifs-sim.txt.gz (15M): similarities between all pairs of motifs.
- motifs.txt (1.1M): all the known and discovered motifs.
- motifs-toscan.txt.gz (875K): known and discovered motifs plus the control shuffles in log-odds format with cut-off following name.
- enrichments.txt.gz (34M): the enrichments of every motif in every dataset. Columns indicate the (1) background, (2) dataset with (3) corresponding factor group, (4) motif, (5) enrichment (as defined above), the count of the motif in the (6) background and (7) foreground, and the count of the control motifs in the (8) background and (9) foreground.
- exp-regions-motifs.txt.gz (123M): for each experimental region (with names as described in the mapping scheme) a semicolon separated list of matching motifs (in the order they occur on the positive strand).
Last modified on 2013-11-01.