Gene Finding Methods
Outline
Overview
This document explains how automated gene calls were produced for the Vibrio cholerae genome. The annotation was created in two steps:
- Final gene structures were predicted using a combination of ORFs from Glimmer,GeneMark and manual annotation. This process is described in section Gene Structure Prediction.
- Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
Gene Naming
The gene names for the annotated Vibrio cholerae ORFs were assigned using the following procedure:
We used the top three blastx hits as evidence to name our final gene set. Manual review of all available evidence was used to come up with the best informative name and to resolve any discrepancy among the names derived from two or more top hits.
Blast names beginning with "X Homolog", "X Family", "Similar to X protein" etc were changed to begin with the prefix "hypothetical protein similar to X". Manual review of the top blast evidence was used to assign the best informative gene product name and to resolve discrepancy among the names derived from two or more top hits. Finally, if no name could be obtained by any of the above steps, we named them as "hypothetical proteins".
Gene Structure Prediction
Gene structures were predicted using a combination of gene predictions from Glimmer, GeneMark and manually annotated ORFS using our evidence-based prokaryotic gene calling method. Regions which are known to contain tRNAs and rRNAs are excluded from consideration on that strand. In loci with no blast coverage, an ab initio predicted ORF is chosen only when both Glimmer and GeneMark predict a gene with a common stop codon on that strand. In this case, the longer prediction is used. An upper limit of 200 base overlap is allowed between adjacent genes. Automated gene models with significant differences in ORF length from available blast evidence were manually reviewed. Manual editing was done to merge, split or refine gene boundaries if necessary. Further, we reviewed all intergenic regions longer than 1.0 kb in length containing any good blast evidence and manually created new ORFs if sufficient evidence was present. Problematic annotations containing recognizable sequence gaps, errors and frame shifts are flagged with appropriate curation flags.
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form XXXX_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Structure Prediction Validation
Due to lack of any species-specific EST/mRNA data, we did not perform our standard structure prediction validation and hence we do not have a good measure of the accuracy of our gene predictions for this genome.
