Yeast Comparative Genomics

Yeast Comparative Genomics

Broad Institute of MIT and Harvard


We have carried out a comparative analysis of four closely related Saccharomyces yeast species. We sequenced the complete genomes of S. paradoxus, S. mikatae and S. bayanus, and compared these to S. cerevisiae, commonly known as baker's yeast.

Our results include a major revision of the yeast genome affecting 15% of all genes, the discovery of a complete dictionary of strongly conserved regulatory motifs, and the elucidation of regions and mechanisms of rapid genomic change.

The results of the analysis have been published in

Sequencing and Comparison of Yeast Species to Identify Genes and Regulatory Elements
Manolis Kellis (Kamvysselis), Nick Patterson, Matthew Endrizzi, Bruce Birren, Eric S. Lander

Nature 423, 241 - 254 (2003); doi:10.1038/nature01644
(download PDF)

Correspondence to M.K. ( and E.S.L. (

Who are we?

The Eli and Edythe L. Broad Institute is a partnership among MIT, Harvard and affiliated hospitals and the Whitehead Institute for Biomedical Research. Its mission is to create the tools for genomic medicine and make them freely available to the world and to pioneer their application to the study and treatment of disease.

The Broad Institute has also developped large-scale computational tools for the use and analysis of high-throughput sequence data. These include whole-genome shotgun assembly, cancer classification, SNP discovery and creation of haplotype maps, and comparative genome analysis.

Goals of this project

With the completion of the Human and Mouse genomes and the increasing number of mammalian genomes in the sequencing pipelines of centers around the world, comparative genomics holds the promise for understanding complete genomes. The development of new algorithmic and statistical methods for comparative genome analysis will be crucial for interpreting genomic information.

We chose to apply these ideas to the yeast Saccharomyces cerevisiae, commonly known as baker's yeast. S. cerevisiae is the most well-studied eukaryotic organism, allowing us to validate our findings against a wealth of experimental data. Moreover, the small genome size of yeast allowed us to obtain the sequence of multiple relatives at a moderate cost. Finally, the strong community of yeast researchers and a strong collaboration with the Saccharomyces Genome Database (SGD) group have made these data accessible to researchers around the world for almost a year before publication.

We sequenced the complete genomes of S. paradoxus, S. mikatae and S. bayanus, using the whole-genome shotgun sequencing method. We obtained on average 7-fold redundant coverage using paired-end reads and assembled the data using the Arachne computer program, also developped at the Broad Institute and publicly available. The resulting assemblies showed long-range continuity and very small gaps. Additionally, the closely related S. cerevisiae provided confirmation of the quality of our assemblies.


We aligned the genomes automatically, developping algorithms that use the conservation of gene order (synteny) to resolve the correspondence of orthologous genes despite the large number of duplicated genes in the yeast genome. The vast majority of genes have unambiguous orthologous matches, and synteny blocks cover more than 90% of the genome, allowing us to construct genome-wide multiple alignment across the four species.

We used these alignments to identify protein-coding genes. We showed that more than 500 currently annotated genes are not real, and discovered 43 novel genes. We propose a revised gene catalogue containing 5700 genes, with refined gene boundaries for more than 300 genes, 40 novel introns, and 33 merges of consecutive ORFs. We also identified sequencing errors in S. cerevisiae and propose corrections by resequencing.

We also used the comparisons to discover regulatory motifs, the control signals used to turn genes on or off. These are typically small (6-8 nucleotides) and found at variable distances upstream of genes. In a single genome, regulatory motifs are indistinguishable from non-functional nucleotide stretches of similar length. However, in the comparison of multiple genomes, these signals become apparent by their stronger genome-wide conservation. We systematically discovered all strongly conserved nucleotide patterns, and constructed a list of 72 genome-wide motifs. These include most previously known motifs, and a number of new motifs. We assigned candidate functions to the majority of these by their enrichment in functionally related genes. Finally, we showed evidence of combinatorial control of gene regulation, whereby motif combinations change the functional specificity of downstream genes.

Finally, we identified regions and mechanisms of genomic change. We showed that telomeric regions are rapidly evolving by recombination and duplication events, leading to protein family expansions involved in environment adaptation. We showed that translocations and inversions are mediated by specific sequences. We identified genes under positive selection for rapid divergence, and genes that have remained identical at the nucleotide level.


See Supplementary Methods for a complete description of strains used and methods for sequencing, assembly, annotation, nucleotide alignments, genomic rearrangements, ambiguous ORFs, reading frame conservation, gene boundaries, resequencing, intron finding, genome-wide motif discovery, motif conservation score, the three mini-motif conservation criteria, motif extension and collapsing, and category-based motif discovery.

Supplemental Data and Information

All sequences, alignments, supplementary figures and tables can be downloaded at the FTP site (see HTML index).

Motif and ORF information is also indexed and searchable

All supplementary information is linked at: