Gene Finding Methods

Outline

Overview

This document describes some of the details of the method used to produce the automated gene calls for the Candida genomes. Automated gene calls were produced in a two-step procedure:

  1. Gene location and structures were predicted as described below for each genome.
  2. Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.

Gene Structure Prediction

The protocol used for Candida gene structure prediction (see below) is different from what we use for other fungal genomes such as N. crassa and M. grisea, and the gene prediction accuracy report we provide for other fungal genomes is not available for Candida. Therefore, this gene set should be considered provisional, since it represents a first pass at gene calling, with comparative data used for only a subset of genes. We expect to refine this set of gene calls in the future to take full advantage of comparisons with the other Candida genomes, to improve the calling of spliced genes, and to minimize the number of overlapping gene calls in the genome.

Genes were predicted by finding all open reading frames and then refining the annotation based on the Candida albicans (SC5314 version 19 ORF) protein set. Proteins smaller than 150 amino acids were removed from the gene set if they did not have a significant blast hit to a C. albicans protein (expect < 1e-5 and 60% coverage of the longer protein). Additionally, for gene pairs with overlapping genome coordinates, the smaller gene of the pair was removed from the gene set if it was clearly shorter (less than 80% of the length of the longer gene), and its best blast hit to C. albicans was clearly worse than the best hit for the longer gene (score less than 80% of the larger gene's score).

Gene Naming

Genes are assigned names VERY CONSERVATIVELY. Because this is a purely automated gene prediction process, we do not want to propogate mis-information by transfering unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene names, that make up 3 categories:

  1. NAME, or
    hypothetical protein similar to NAME, or
    conserved hypothetical protein

    Assigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:
    • Top BlastP hit to a known NR protein (complexity filtering off -F F, expect <= 1e-5), with
    • >=80% identity and >= 80% coverage of both the query and subject sequence.

The exact name is assigned:

    • NAME if the homologous protein is from the curated SwissProt gene set (IE we trust the gene name), otherwise:
    • conserved hypothetical protein if the homologous protein NAME contains a word in the set {hypothetical, homolog, probable, putative, similar to, predicted, unnamed, unknown} (IE we do not want to transfer suspect names), otherwise
    • hypothetical protein similar to NAME

In all cases we take the NR protein name and try to filter out the species name, GIs, and extra whitespace

  1. Hypothetical protein
    Assigned to gene predictions that show significant BlastP homology to a protein in NCBI's protein set NR or an EST alignment. The criteria for this category are:
    • BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), or
    • EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene
  2. Predicted protein
    Assigned to gene predictions that do not have an EST alignment or show significant BlastP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BlastP analysis was performed on the gene set. The criteria for this category are:
    • No BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), and
    • No EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form XXXG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as XXXG_#####.version). When genes are mapped from one assembly to another or when we release a new set of gene calls, we will increment this version. All the loci in a particular release will have the same version number so that we can ensure consistency.

The locus prefixes for these genomes are:

C. albicans CABG
C. tropicalis CTRG
C. guilliermondi PGUG
C. lusitaniae CLUG
L. elongisporus LELG
C. dubliniensis CDUG
C. parapsilosis CPAG