This website allows you to browse known and discovered motifs for the all the experimental datasets.
Send any questions/comments to Pouya Kheradpour.
Last modified on 2011-11-23.
Previous versions:
201006freeze
201101freeze-run1
- Each experiment is put into a "factor group" on the basis of ChIP TF and its known motifs with the intention to group factors with very similar motifs. Known motifs are assigned to the factor group using the same critera.
- The navigation is done in the left frame. For each factor group it indicates the the number of known motifs, discovered motifs, and experimental datasets. Clicking on the headers will resort the table.
- This is the README page and the performance page indicates how often the top motif found for a factor group matches (correlatoin at least 0.75) any known motif assigned to that factor group.
- The "discovered motifs" for each factor group are the top 10 (in terms of enrichment in their discovery dataset using the Intergenic background) where no two are more than 0.75 similar to each other (this prevents very similar variants of the same motif from being taken).
- Enrichments are computed by taking the fraction of motif instances that are inside the bound regions and dividing that by the fraction of shuffle motif instances inside (where the bound regions are filtered against the background regions, defined below). They are also corrected for small counts by using a confidence interval (with Z=1.5) around each fraction and taking the extreme which leads to the enrichment closest to 1.
- Clicking on a factor group will change the middle and right frames.
- The middle frame shows the known and discovered motifs. Clicking on the motifs will provide the PFM (position frequency matrix). All the known and discovered motifs are also available in one file.
- The right frame is a heatmap indicating:
- Top; in white/black color scale: The similarity (in correlation) between the known/discovered motifs. This is computed directly from the PFMs without using the genome at all.
- Below; in white/red scale: the enrichment of each of the motifs. Enrichments are not shown for known motifs that we didn't have a experimental dataset for, because they weren't scanned, and for motifs for which for controls could not be created or with too little information content.
The enrichments are for three different background regions as the three triangles (all intergenic/intronic, only +/-2kb from TSS and outside +/- 2kb, in top, left and right, respectively). All three backgrounds exclude coding, 3'UTRs, and repetitive regions. The indicated number is for all Intergenic.
Experiments names are systematically named using a name mapping scheme.
- Motifs are matched to the genome using a p-value of 4^-8 using TFM-PVALUE. A custom program is used to do the actual matching.
- The motif matches file can be used for carrying out custom analyses (all coordinates are in hg19).
- The lines that have the first column end in _C# should be ignored. They don't actually match the motif, but rather match shuffles of the motif (e.g., if you want the matches to Factor_known10 you should take the rows that have a first column that exactly matches Factor_known10_8mer).
- The file is 1 indexed, end inclusive. You can ignore the columns after the strand (they refer to the conservation level, etc, which is not used for any of this analysis).
- The regions file contains the background regions (indicated by the first column +) and the experimental regions filtered against the background (with names as described in the mapping scheme). This file is also 1 indexed and end inclusive.