| Orthogroup Repository Documentation |
In addition to assignments of genes to orthogroups and providing these assignments between each pair of species, we also included various other data obtained through sequence-based bioinformatics analysis that are pertinent for comparative studies. This includes sequence features such as domains and multiple sequence alignments for the members of each orthogroup. We also demonstrate the gene tree topologies that we infer for each orthogroup through our Synergy procedure and by maximum-likelihood estimation from the sequence alignments.
A comprehensive set of links is provided from each orthogroup's dedicated page. Some of these links lead to other orthogroups that may share significant sequence similarity or the same set of domains. Other links take users to external resources that provide rich functional and genomic annotations as well.
We plan to update this website as newer data become available. For this reason, we archived an earlier version of this data that was used for the analysis of our manuscript entitled A natural history and evolutionary principles of gene duplication in fungi. Our hope is that by providing such a genome-wide catalog of comparative genomic data, users may learn about the function and evolution in these fungal species.
Note that for this repository of fungal orthogroups, we seeded our algorithm with the curated orthologs taken from YGOB and the manually curated ortholog inventory between S. pombe and S. cerevisiae created by Valerie Wood at the Sanger Institute. The orthologs for the species not included by these databases were inferred automatically by Synergy.
You can click on the "OG" link to access the orthogroup's page (in this case OG #19). You can also access other pertinent data for that orthogroup, such as the protein sequence alignment or gene trees; the "sequence tree" is based on the protein sequence alignment while the "orthogroup tree" has the topology inferred by Synergy.
You can also search using a species-specific query, which can be faster, or you can search for an orthogroup's index. The search will return any orthogroup that contains a substring of the query used to search with. For example, search for "YGR" will return all the orthogroups containing genes whose names contain that sequence of characters.
On the banner on the top of the page shows that orthogroup's
identifier, in this case OG376. The table on the left portion of the
page displays the number orthologs present in this orthogroup (19) and
the number of taxa represented in it (13). Below these are
the following links:
Below these links we provide a table of homologous orthogroups that share significant sequence similarity with the sequences in this orthogroup. The table provides individual links to these homologs, if there are any, and the link from the table header will invoke a pop-up window (if JavaScript is enabled) that includes all the corresponding links for users to browse between several related orthogroups.
The right side of the orthogroup pages lists the orthogroup's member genes. The gene names are given in the center column of the table and the species they come from are shown next to them on the left. Some genes have links that take users to relevant genomic data resources, such as SGD, CGD, Genolevures and GeneDB. Users can retrieve pertinent annotations for these genes from these corresponding websites. Genes belonging to the three species predating the Whole Genome Duplication (K. waltii, A. gossypii, and K. lactis) and the post-duplication species (S. cerevisiae, C. glabrata, and S. castellii) have links to the Yeast Gene Order Browser, an excellent resource for inferring homology relations using the shared chromosomal order between orthologs (synteny).
On the right-most portion of this table is a matrix displaying which sequence domains were predicted for each gene. For example, all sequences in OG #376 share one domain (Ribosomal protein L11) and all but the S. mikatae gene smik1292-g2.1 share a second domain (Ribosomal protein L12). The link on the top of this matrix will invoke a pop-up window that displays a table describing each domain and links to that lead users to descriptions of these domains, as well as links to other orthogroups that contain each domain. Following these links allows users to browse between orthogroup that share similar sequence domains. The headers for the matrix also include links to each domain's InterPro page.
It is possible that some of the orthogroups contain erroneous or missing orthologous relations. By toggling between such similar orthogroups, one can inspect whether some of the genes in an orthogroup should belong among those of another orthogroup, or whether a particular orthogroup contains relations that are spurious. Among the aims in making such browsing between similar orthogroups possible is to make it easier to identify such errors.
The last row of the "Orthogroup homologs" table for "Remote homologs" is currently not in use. The eventual purpose of this feature is to provide links to potentially homologous genes in taxa that were not included in this repository. If you believe this feature would be especially useful to you, please contact us to let us know.
Reconstructed gene tree
This is the gene tree reconstructed by the Synergy
algorithm. The branch lengths for this tree have been recomputed
with the SEMPHY package with the tree-assisted sequence
alignement and using a fixed tree topology and
optimizing the branch lengths with the JTT substitution matrix and
among site rate variation.
Sequence-based gene tree
This is the Maximum Likelihood tree generated by SEMPHY using
the naive protein sequence alignment and among site rate variation.
The gene tree reconstructed by Synergy is more likely to include fewer duplication and losses than a typical gene tree generated by using the multiple sequence alignment alone. For example, this is the reconstructed gene tree for Orthogroup #1387, which can be reached by following the "reconstructed gene tree" link on the orthogroup's page.
By inspecting this gene tree and comparing it to the species tree, we can infer a single duplication event at the Whole Genome Duplication, and a subsequent single gene loss in the lineage leading to C. glabrata; the branch lengths suggest that the lineage including this loss has adopted many more mutations than its paralogous counterpart. However, the sequence-based gene tree would invoke many more duplication and loss events, and the correct orthology and paralogy relations between the orthogroup's members would not be accurately identified from its topology:
The tree file can be downloaded by following the links provided from these windows. The trees are in Newick format. This is a standard format can be opened by most phylogeny viewing software such as TreeView or ATV.
The first column lists the genes in the query genome and the ortholog(s) in the target genome are listed in the following column(s).
Here is an example taken from the orthology assignments
betweenS. cerevisaie and A. gossypii:
| YDR256C | AGL256W
| YGR088W | NONE
| SPCC757.07c | YDR256C | YGR088W |
The files are formated in the BLAST -m 8 tabular format. Each column represents the following:
> 9880 10 10 Scer|YHR083W Spar|spar35-g6.1 Smik|smik313-g7.1 Sbay|sbayc570-g2.1 Cgla|CAGL0H02167g Scas|Scas584.12 Kwal|Kwal55.20177 Klac|KLLA0E17171g Agos|ADR303W Dhan|DEHA0F26708g
The next two lines show gene trees, both in Newick format. The first is a regular whose leaves contain the genes in the corresponding orthogroup. The second tree assumes that that orthogroup had a gene present in the last common ancestor of the Ascomycete clade and indicates where gene losses occurred. The losses are indexed to show where along the species tree they occurred.
The next line indicates most ancient index in the species tree the orthogroup can be traced to; that is, where its root can be found.
Root: 24 Actual root: 22
The next two lines show the number of genes ("Counts") in the orthogroup that are present in each node of the species phylogeny. The indices of the species are available here. The "Actual counts" assumes that the orthogroup is as ancient as it's last common ancestral species, while the "Counts" assumes that the orthogroup represents an ancestral gene that was present at the root of the species tree.
Counts: 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Actual Counts: 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
The "Duplications" and "Losses" tracks are similar to the "Counts", showing the number of duplications and losses identified at the corresponding nodes of the species tree.
Duplication: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Losses: 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Actual Losses: 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0