Candida Database

Candida albicans WO-1

The sequencing and annotation of five Candida species (C. albicans (WO-1), C. tropicalis, L. elongisporus, C. guilliermondii, and C. lusitaniae) was proposed by the Fungal Genome Inititative and funded by the NHGRI. These genomes as well as C. albicans (SC5314) sequenced at Stanford and the Biotechnology Research Institute of the National Research Council Canada and three related species being sequenced at the Welcome Trust Sanger Institute and at Genoscope provide an extraordinary opportunity for the comparative analysis of genes and genomes across the Candida clade, as well as to the sister Saccharomyces clade.


Project Background

Candida species are the most common human fungal pathogens. Candida cause severe systemic disease in individuals who are immunocompromised, post-surgery, or taking broad-spectrum antibiotics. Although a single species, C. albicans, is responsible for about half of the Candida infections, a wide variety Candida species contribute to the remainder, and the prevalence of these non-albicans infections is increasing (1). The genome sequence of C. albicans (SC5314) (2, 3) represented an enormous advance in the study of Candida, providing the foundation for systematic gene studies. As part of the Fungal Genome Initiative, we have sequenced and annotated five Candida species: C. albicans (WO-1), C. tropicalis (MYA-3404), L. elongisporus (NRRL YB-4239), C. guilliermondii (ATCC6260), and C. lusitaniae (ATCC42720). All clones for these five genomes were constructed here at The Broad Institute. Three additional recently sequenced genomes also fall within this phylogenetic group: C. dubliniensis, C. parapsilosis, and Debaryomyces hansenii. This growing set of Candida genome sequences allows comparisons across a range of evolutionary distances, enabling many different approaches to study the conservation of genes and regulatory elements as well as the evolution of these elements and genomic architecture within Candida species.


This work is taking place within the framework of a community based analysis project led by Christina Cuomo (Broad Institute), Geraldine Butler (University College, Dublin, IRE), Neil Gow (University of Aberdeen, UK), and Michael Lorenz (University of Texas, Houston Medical School).


SNPs were discovered by comparing the sequencing reads with the reference assembly.

The reads were aligned with the assembly using the Blat algorithm. The alignments thus obtained were filtered; reads that did not have a unique placement in the assembly or had >20% gaps in them were rejected.

Neighborhood Quality Standard (NQS) algorithm was used to identify high confidence polymorphic sites. SNPs were identified as single-base sequence variants that had a minimum PHRED base quality of 25 at the position of mismatch, a neighborhood base quality of 20, and no mismatches within 5 bp.

SNP data can be downloaded here.


  1. K. C. Hazen, Clin Microbiol Rev 8, 462 (Oct, 1995).
  2. B. R. Braun et al., PLoS Genet 1, 36 (Jul, 2005).
  3. T. Jones et al., Proc Natl Acad Sci U S A 101, 7329 (May 11, 2004).