Supplementary Methods

METHODS

Strains, sequencing and assembly

The sequence and annotation for S. cerevisiae reference strain S288C was obtained from the SGD website ( http://www-genome.stanford.edu/Saccharomyces ) in May 2002. S. paradoxus strain NRRL Y-17217, S. mikatae strain IFO1815, and S. bayanus strain MCYC623 were provided by Ed Louis, Department of Genetics, University of Leicester, Leicester, UK. All species are diploid. The WGS sequence was obtained at the Center for Genome Research. We used paired-end sequencing of 4kb plasmid clones, with lab protocols as described at http://www.broad.mit.edu/. Assemblies were constructed using the Arachne whole genome assembly program. Only scaffolds that contain at least one contig longer than 2kb were kept. We used BlastN to find the best match of every read in the S. cerevisiae genome and inferred a rough correspondence between scaffolds and S.cerevisiae chromosomes. We re-ordered and oriented the scaffolds according to the corresponding chromosomal position on the S. cerevisiae genome. Supplementary information contains all contigs and scaffolds (S1a).

Determining one-to-one orthologous ORFs

For each species, we gathered the complete set of all predicted Open Reading Frames (pORFs) starting with a methionine and at least 50 amino acids long. We compared the protein sequences of these pORFs against the set of annotated protein sequences in S.cerevisiae, using WU-BLAST BlastP with parameters W=4, hitdist=60 and E=10-9. Based on the blast hits, we connected pORFs to S. cerevisiae ORFs in a bipartite graph. We weighed every edge connecting two ORFs by their average amino acid percent identity and the total length aligned in blast hits. We developed a method to automatically separate this bipartite graph into progressively smaller subgraphs until the only remaining matches connect true orthologs. To achieve this separation, we eliminated edges that are sub-optimal in a series of steps. As a pre-processing step, we eliminated all edges that are less than 80% of the maximum-weight edge both in amino acid identity and in length. This step resulted in a large number of unambiguous one-to-one matches for each of the species. Based the unambiguous matches, we built blocks of conserved gene order (synteny) when S. cerevisiae ORFs within 20kb of each other matched proximal pORFs in the assembly. These synteny blocks allowed us to resolve additional unambiguous matches, by preferentially keeping matches within synteny blocks over non-syntenic matches, and marking as paralogs the ORFs that remain orphan. We then extended the synteny blocks based on the newly resolved one-to-one matches, and resolved additional matches based on the extended blocks. We then separated the bipartite graph into its connected components, homology groups of unresolved genes. Finally, we searched for smaller homology groups that can be isolated solely based on the edge weights: we eliminated an edge between two groups when both endpoints had stronger matches to other ORFs within their respective groups. This reduced the size of the remaining homology groups, leaving unresolved mostly subtelomeric gene families for which synteny provided no resolving power. The supplementary information includes the complete correspondence between the S. cerevisiae proteins and pORFs in each of the species (S2a), the blocks of conserved synteny (S2b), and the final homology groups (S2c), dotplots visualizing the gene correspondence between each species and S. cerevisiae (S3a), and a detailed tiling of gene correspondence across the S. cerevisiae chromosomes (S3b and S3c).

Nucleotide-level alignment

We constructed nucleotide-level alignments of unambiguous regions using ClustalW. We created multiple alignments of every S. cerevisiae ORF with the orthologous pORF in each species with one-to-one correspondence. We also aligned orthologous intergenic regions when both flanking ORFs had one-to-one correspondence. When sequence gaps were present in one or multiple species, we constructed the alignment in multiple steps. We first aligned the gapless species creating a base alignment. Then we aligned each portion of a partial species onto the base alignment, and constructed a consensus for each species based on the individually aligned portions. We marked missing sequence between contigs by a dot and disagreeing overlapping contigs by N. Finally, we constructed a multiple alignment of the four species by merging the piece-wise alignments. The alignments are given in S1c and S1d. We counted transitions, transversions, insertions and deletions within these alignments (Table S4a) and used these to estimate the rate of evolutionary change between the species (Figure 3). We counted the rate of synonymous and non-synonymous substitutions for every protein coding gene (Table S4b) to find evidence of positive selection.

Genomic rearrangements

We considered consecutive unambiguous matches, marking changes in gene spacing, gene orientation, and off-synteny matches between scaffolds and orthologous S. cerevisiae chromosomes (Tables S5a). We found that changes in gene spacing are typically associated with transposon insertions and associated novel genes, as well as tandem duplications (Figure 2 and Table S5c). Virtually all changes in gene orientation typically affect between 2 and 10 consecutive ORFs and can be traced to one of 16 multi-gene inversions (Table S5b). The majority of off-synteny matches involve a single ORF and only 20 involve more than 2 consecutive ORFs. Virtually all single-gene off-synteny matches were contained within ancient duplication blocks of Saccharomyces as described in and http://acer.gen.tcd.ie/~khwolfe/yeast/nova/. These probably represent previously duplicated genes that were differentially lost in different species, rather than a DNA break in one of the two lineages, as was previously noted in . Off-synteny matches that involve more than two genes from the same chromosome correspond to one of 20 chromosomal exchanges (Figure 2 and Table S5b).

Clustering of ambiguous genes

We used ORFs with ambiguous correspondence to determine regions of rapid change in S. cerevisiae. We marked the chromosomal location of all S. cerevisiae ORFs that are ambiguous in at least one species. We then constructed ambiguity clusters when two or more ambiguous ORFs within 16kb of each other. We counted the number of ambiguities in each cluster, counting more than one ambiguities for an ORF whose correspondence was ambiguous in more than one species. Only 32 clusters were found containing more than two ambiguities (Table S5d and Figure 2). We ignored two clusters due to regions of low coverage in S. mikatae and one cluster corresponding to a previously described inversion.

Reading Frame Conservation

We evaluated reading frame conservation (RFC) for S. cerevisiae ORFs by first aligning the orthologous region for each species, then counting the percent of nucleotides in frame in the alignment, and finally casting a vote for each species and tallying the votes. For each species, we used BlastN to find a list of similar regions to the S. cerevisiae ORF in question. If the ORF lied in a synteny block, we preferentially selected the best blast hit within that block. If the ORF had no synteny information or no syntenic match, we selected the blast hit with the most nucleotide identities. We then evaluated the frame conservation within the blast alignment. We labeled each nucleotide of the first sequence by their position within a codon, as 1, 2 or 3 in order, starting at codon offset 1. We similarly labeled the nucleotides of the second sequence, but once for every start offset (1, 2, or 3). We then counted the percentage of gapless positions in the alignment that contained the same label in both aligned species, and selected the maximum percentage found in each of the three offsets of the second sequence. The final RFC value for the ORF was calculated by averaging the percentages obtained at overlapping windows of 100 nucleotides starting every 50 nucleotides. We found that the distribution of frame conservation within each species is bimodal, and we chose a simple cutoff for each species, 80% for S.paradoxus, 75% for S.mikatae and 70% for S.bayanus. If the RFC of the best hit was above the cutoff, a species voted for keeping the ORF tested. If the RFC was below the cutoff and the hit was trusted as orthologous, the species voted for rejecting the tested ORF. Finally, if no orthologous hit could be found due to coverage, a species abstained from voting. We calculated a score between -3 and +3 for every ORF based on the number of species that accepted it (+1) and the number of species that rejected it (-1). We kept all ORFs with a score of 1 or greater, and rejected all ORFs with a score of -1 or smaller. We manually inspected the remaining ORFs. We chose not to reject ORFs based on Ka/Ks ratios to avoid rejecting fast evolving ORFs. The final list of decisions for each ORF is shown in Table S6a, and the detailed results for each species are shown in Table S6b. A window-based RFC analysis for ORFs is shown in S6c.

Validating small ORFs

To evaluate the predictive power of the RFC test for small ORFs, we additionally tested for presence of in-frame stop codons in the other species. When a small ORF in S. cerevisiae showed a strong overall frame conservation, we measured the length of the longest ORF in the same orientation in each orthologous locus. We measured the percent of the S. cerevisiae length that was open in each species, and took the minimum of the three percentages (OPEN) across the three additional species. When the reading frame was open in each of the other species, the lengths found were identical to that of S. cerevisiae, and OPEN was 100%. When OPEN was below 80%, we concluded that the RFC test failed to find a true ORF conserved across all four species. Figure S3c shows the distribution of OPEN for different values of RFC. For S. cerevisiae ORFs between 50 and 100 amino acids (aa), selecting for high RFC automatically selects for high OPEN, and we estimate the test has high specificity (Figure S6d panel 1). For ORFs between 30 and 50 aa however, only a small portion of the ORFs with high RFC show a high OPEN (Figure S6d panels 2 and 3), and we conclude that the lack of indels within the small interval considered is not due to selective pressure, but instead lack of evolutionary distance between the species aligned.

Refining ORF boundaries and resequencing

We examined the multiple alignment of unambiguous ORFs to identify discrepancies in the predicted start and stop codons across the four species. We searched for the first in-frame ATG in each species and compared it to the annotated ATG in S. cerevisiae. In the S. cerevisiae start was not conserved, we automatically suggested a changed translation start if a subsequent in-frame ATG was conserved in all species and was the first in-frame ATG in at least one species. Otherwise, we manually inspected the alignment for a conserved ATG. Similarly, we suggested changes in stop codons when a common stop in all other species disagreed with the S. cerevisiae annotation. We manually inspected the alignments to confirm that the suggested start and stop boundary changes agreed with conservation boundaries. We observed a lower overall conservation as well as frame-shifting indels outside the new boundaries. We identified merges of two consecutive S. cerevisiae ORFs, when both showed protein blast hits to a single ORF in at least one other species, and when their lengths added up to the length of the matching ORF. All merges and boundary refinements suggested specific changes to the nucleotide sequence of S. cerevisiae (except 3 changes of translation start that required no change). To validate our predictions, we re-sequenced the sites of predicted sequence discrepancies. We used both forward and reverse reads in two different PCR reactions spanning the site. Supplementary information includes the primers used and the alignments of resulting reads against S. cerevisiae (S7e) and a summary of resequencing results (S7d), alignments of all predicted single-species frame-shift mutations (S7c), alignments of suggested boundary changes (S7b) and a summary of suggested start and stop refinements (S7a).

Novel introns

To identify novel introns, we searched for conserved and proximal splice donor and branch signals and manually inspected the resulting alignments. Having constructed multiple alignments of ORFs and flanking intergenic regions, we searched for conserved splicing signals. We used 10 variants of splice donor signals (6-7bp) and 8 variants of branch site signals (7bp) that are found in experimentally validated S. cerevisiae introns. We searched each species independently but required that orthologous signals appear within 10 bp from each other in the multiple alignment of the region. We also required that branch and donor be no more than 600bp apart, which is the case for 90% of known S. cerevisiae introns. We then inspected the multiple alignment surrounding the conserved signals for three properties: (1) a conserved acceptor signal, [CT]AG, 3 of the branch site (2) high RFC 5 of the donor signal and 3 of the acceptor signal. (3) low RFC within the intron. Roughly half of the conserved donor/branch pairs met our additional requirements. The remaining ones were typically internal to ORFs and could be conserved by chance alone, given the overall high nucleotide conservation in ORFs. Supplementary information contains a summary of intron revisions (S7e) and the alignments of known and predicted introns (S7f).

Motif Conservation Score

To evaluate the Motif Conservation Score (MCS) of a motif m of given length and degeneracy, we compared its conservation ratio to that of random patterns of the same length and degeneracy. We first computed the table F containing the relative frequencies of two-fold and three-fold degenerate bases, given the S. cerevisiae nucleotide frequencies (.32 for A and T, .18 for C and G). For example, W=[AT] (.32*.32) is a more likely two-fold degenerate base than Y=[CT] (.18*.32). We then selected 20 random intergenic loci in S. cerevisiae. For each of these loci, we used the order of nucleotides at that locus together with the order of degeneracy levels in m to construct a random motif. If the first character of m was two-fold degenerate and the first nucleotide at the selected locus was A, we picked a two-fold degenerate base containing A (W, R or M), their relative frequencies dictated by F, and continued for every character of m. We then counted conserved and non-conserved instances of each of the 20 selected patterns and computed r, the log-average of their conservation rates. We then counted the number of conserved and non-conserved intergenic instances of m, and computed the binomial probability p of observing the two counts, given r. We finally reported the MCS of the motif as a z-score corresponding to p, the number of standard deviations away from the mean of a normal distribution that corresponds to tail area p. We found that the majority of known motifs have a MCS of at least 4 standard deviations.

CC1: Intergenic conservation

We searched for mini-motifs that show a significant conservation in intergenic regions. For every mini-motif, we counted the number of perfectly conserved intergenic instances in all four species ic, and the total number of intergenic instances in S.cerevisiae i. We found that the two counts seem linearly related for the large majority of patterns (Figure 7a), which can be attributed to a basal level of conservation r given the total evolutionary distance that separates the four species compared. We estimated the ratio r as the log-average of non-outlier instances of ic/i within a control set of all motifs at a given gap size. We then calculated for every motif the binomial probability p of observing ic successes out of i trials, given parameter r. We assigned a z-score S to every motif corresponding to probability p. This score is positive if the motif is conserved more frequently than random, and negative if the motif is diverged more frequently than random. We found that the distribution of scores is symmetric around zero for the vast majority of motifs. The right tail of the distribution however is much further than the left tail, containing 1190 motifs more than 5 sigma away from the mean, as compared to 25 motifs for the left tail. By comparing the two counts, we estimated that 94% of these 1190 motifs are non-random in their conservation enrichment.

CC2: Intergenic-genic conservation

We searched for motifs that are preferentially conserved in intergenic regions, as compared to coding regions. In addition to ic and i (see previous section), we counted the number of conserved coding instances gc, and the number of total coding instances g, for every mini-motif. We observed the ratio of conserved instances that are intergenic a=ic/(ic+gc), and compared it to the total ratio of motif instances that are intergenic b=i/(i+g). Not surprisingly, we found that typically b=25% of all motif instances appeared in intergenic regions, which account for roughly 25% of the yeast genome. Similarly, only a=10% of conserved motif instances appeared in intergenic regions, which reflects the lower conservation of intergenic regions. To correct for this typical depletion in intergenic conservation, we estimated a correction factor f=a/b for mini-motifs of similar GC-content. Then for a given mini-motif, the proportion of all instances found in intergenic regions and the correction for the lower conservation of intergenic regions together gave us r=f*i/(i+g), the expected ratio of conserved intergenic instances for that motif. We evaluated the binomial probability p of of observing at least ic conserved instances in intergenic regions and ic+gc conserved instances overall, given the expected ratio r. As in CC1, we computed a z-score S for every motif and found a distribution centered around zero for the large majority of motifs, and a heavier right tail. We selected 1110 motifs above 5 sigma and estimated that 97% are non-random as compared to only 39 motifs below -5 sigma.

CC3: Upstream-downstream conservation

We searched for motifs that are differentially conserved in upstream regions and downstream regions. We defined upstream-only intergenic regions in divergent promoters that are upstream of both flanking ORFs, and downstream-only intergenic regions in convergent 3 terminators that are downstream of both flanking ORFs. We then counted uc and u, the conserved and total counts in upstream-only regions, and similarly dc and d in downstream-only regions. We found that upstream-only and downstream-only regions have similar conservation rates, and the ratios uc/u and dc/d are both similar to ic/i for the large majority of motifs. We thus used a simple chi-square contingency test on the four counts (uc,u,dc,d) to find motifs that are differentially conserved. We found 1089 mini-motifs with a chi-square value of 10.83 or greater, which corresponds to a p-value of .001. Given the multiple testing of 45760 mini-motifs, we estimated that roughly 46 will show such a score by chance and that 96% of the selected motifs will be non-random.

Constructing full motifs

We extended each mini-motif selected by searching for surrounding bases that are preferentially conserved when the motif is conserved. We used an iterative approach adding at every iteration one base that maximally discriminates the neighborhood of conserved motif instances from the neighborhood of non-conserved motif instances. The added base was selected from fourteen degenerate symbols of the IUB code (A, C, G, T, S, W, R, Y, M, K, B, D, H, V). When no such symbol separated the conserved and non-conserved instances with significance above 3 sigma, we terminated the extension. Figure 7D shows the top-scoring mini-motif found in CC1 (Panel 1), and the corresponding extension (Panel 2). We found that many mini-motifs have the same or similar extensions, and we grouped these based on sequence similarity. We measured the similarity between two motifs as the number of bits in common in the best ungapped alignment of the two motifs, divided by the minimum number of bits contained in either motif. Based on the pairwise motif similarity matrix, we clustered the extended motifs hierarchically, collapsing two groups if the average similarity between their member motifs was at least 70%. We then computed a consensus sequence for every cluster of extended motifs, resulting into a smaller number of mega-motifs for each test (332 for CC1, 269 for CC2 and 285 for CC3). Panel 3 shows the first 9 members of the top cluster in CC1, and the resulting mega-motif. Finally, we merged mega-motifs based on their co-occurrence in the same intergenic regions (Panel 4). We computed a hypergeometric co-occurrence score between the intergenic regions hit by each mega-motif and again collapsed these hierarchically. We computed a consensus for every cluster, and iterated the co-occurrence-based collapsing step (results not shown). We obtained fewer than 200 distinct genome-wide motifs. We found 72 consensus motifs with MCS ≥ 4 (Table 3).

Category-based motif discovery (CC4)

To select mini-motifs enriched in a given category, we counted the conserved instances within the category (IN), and the conserved instances outside the category (OUT). We estimated the ratio IN/(IN+OUT) that we should expect for the category, based on the entire population of mini-motifs. We then calculated the significance of an observed enrichment as the binomial probability of observing IN successes out of IN+OUT trials given the probability of success p. We assigned a z-score to each mini-motif, as described in the genome-wide search, and similarly extended and collapsed the significant mini-motifs.

Additional References for Methods

S1. Cherry, J. M. et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res 26, 73-9 (1998).

S2. Oda, Y., Yabuki, M., Tonomura, K. & Fukunaga, M. A phylogenetic analysis of Saccharomyces species by the sequence of 18S-28S rRNA spacer regions. Yeast 13, 1243-50 (1997).

S3. Naumov, G. I., James, S. A., Naumova, E. S., Louis, E. J. & Roberts, I. N. Three new species in the Saccharomyces sensu stricto complex: Saccharomyces cariocanus, Saccharomyces kudriavzevii and Saccharomyces mikatae. Int J Syst Evol Microbiol 50 Pt 5, 1931-42 (2000).

S4. Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res 12, 177-89 (2002).

S5. Jaffe, D. B. et al. Whole-genome sequence assembly for Mammalian genomes: arachne 2. Genome Res 13, 91-6 (2003).

S6. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403-10 (1990).

S7. Wolfe, K. H. & Shields, D. C. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387, 708-13 (1997).

S8. Fischer, G., Neuveglise, C., Durrens, P., Gaillardin, C. & Dujon, B. Evolution of gene order in the genomes of two related yeast species. Genome Res 11, 2009-19 (2001).

S9. Clark, T. A., Sugnet, C. W. & Ares, M., Jr. Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science 296, 907-10 (2002).